Lightgbm & Xgboost

先主要介绍树的基本知识，然后介绍LightGBM和XGBoost及其调参.

进入正文之前简单的说一下决策树。一棵树很容易过拟合或者欠拟合（根据树的深度），然后需要使用多棵树进行组合预测，而GBDT是实现这个手段的方式之一。sklearn中也是实现了 GBDT 这种思想，但是比较难用，训练速度跟不上。但是 lightGBM 和XGBoost 实现的效果就比较好。

lightGBM调参(常用参数)

Since lightGBM is based on decision tree algorithms, it splits the tree with the best fit whereas boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.

Leaf wise splits lead to increase in complexity and may lead to overfitting and it can be overcome by specifying another parameter max_depth parameter.

Advantages of LightGBM

faster training speed and higher efficiency Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
lower memory usage Replaces continuous values to discrete bins which result in lower memory usage.
better accuracy than any other boosting algorithm It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy.
compatibility with large datasets It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBOOST.
parallel learning supported

lightGBM调参(常用参数)

task default= train, option: train, prediction
application default= regression, option: regression, binary, multiclass, lambdarank(lambdarank application)
data training data, 这个比较诡异，你需要创建一个lightGBM类型的data
num_iterations default =100, 可以设置为的大一些，然后使用early_stopping进行调节。
early_stopping_round default =0, will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds.
num_leaves default =31, number of leaves in a tree
device default =cpu, options: gpu, cpu, choose gpu for faster training.
max_depth specify the max depth to which tree will grow, which is very important.
feature_fraction default =1, specifies the fraction of features to be taken for each iteration.
bagging_fraction default =1, spefifies the fraction of data to be used for each iteration and it generally used to speed up the training and avoid overfitting.
max_bin max number of bins to bucket the feature values.因为模型是基于bin训练的，如果bin 数量越多，得到better accuracy,同时更加容易 overfitting.
num_threads
label specify the label columns.
categorical_feature specify the categorical features
num_class default =1, used only for multi-class classification

referrence

which-algorithm-takes-the-crown-light-gbm-vs-xgboost LightGBM 如何调参官方文档param_tuning 官方文档parameter

XGBoost

算法的核心思想

不断去增加树，不断进行特征分裂来生长一棵树，每次增加一个树，其实就是学习一个新函数 $f (x)$ ，去拟合上次预测的残差
当我们训练完成得到 $k$ 棵树，我们要预测一个样本的分数，其实就是根据这个样本的提特征，在每棵树中都会落到对应的一个叶子节点，每个叶子节点上对应一个分数
最后只需要将每个树对应的分数加起来就是该样本的预测值

树该如何生长？

简单说：枚举所有不同树结构的贪心法。

不断地枚举不同树的结构，然后通过打分函数寻找一个最优结构的树，接着加入模型中，不断重复这样的操作。选择一个 feature 分裂，计算 loss function 的最小值，然后再选择一个 feature 进行分裂，枚举完了，选择一个效果最好的。

如果停止树的生长？

有三种方式。

当分裂带来的增益小于设定的阈值时候，我们可以忽略这次分裂，类似预剪枝。
当树达到最大深度则停止建立决策树。可以设置超参数 max_depth
样本权重和小于设置阈值停止建树。min_child_weight，大概是当一个叶子上的样本比较少时候，停止建树，同样是为了防止过拟合。

Advantage of XGBoost

数据层面

handling missing values very useful property. XGBoost has an in-built routine to handle missing values. User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounter a missing value on each node and learns which path to take for missing values in future.
parallel processing we know that boosting is sequential process so how can it be parallelized? this link to explore further.

模型层面

Tree pruning A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. XGBoost on the other hand make splits upon the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
high flexibility XGBoost allow users to define custom optimization objectives and evaluation criteria

训练层面

built-in cross-validation This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
regularization standard GBM implementation has no regularization, in fact, XGBoost is also known as ‘regularized boosting’ technique.
continue on existing model

XGBoost Parameters

xgboost的参数可以分为三类，通用参数/general parameters, 集成(增强)参数/booster parameters 和任务参数/task parameters。

general parameters

General Parameters: Guide the overall functioning

booster: default =gbtree, can be gbtree, gblinear or dart. 一般使用gbtree.
silent: default =0, silent mode is activated if set to 1(no running messages will be printed)
nthread: default to maximum of threads.

booster parameters

Booster Parameters: Guide the individual booster (tree/regression) at each step

eta(learning rate): default=0.3, typical final values to be used: 0.01-0.2, using CV to tune
min_child_weight: minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. default =1,too high values can lead to under-fitting, it should be tuned using CV. 数值越小越容易过拟合，越大越容易 under-fitting.
max_depth: default =6, typical values: 3-10, should be tuned using CV. 设置树的最大深度。默认是 6
gamma: default =0, Gamma specifies the minimum loss reduction required to make a split.如果在分裂过程中小于该值，那么就不会继续分裂。
subsample: default =1, typical values: 0.5-1. Denotes the fraction of observations to be randomly samples for each tree. 观察子样本的比率，即对总体进行随机采样的比例，默认是 1.
colsample_bytree: default =1, typical values: 0.5-1. colsample_bytree和subsample不同点：colsample_by是特征的随机fraction, subsample是rows的随机fraction。用于构造每棵树时变量的子样本比例，即特征抽样。默认是 1.
lambda: default =1, L2 regularization term on weights(analogous to Ridge regression). Though many data scientists don’t use it often, it should be explored to reduce overfitting.
alpha: default =0, L1 regularization term on weight (analogous to Lasso regression)
scale_pos_weight: default =1, a value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

learning task parameters

Learning Task Parameters: Guide the optimization performed

objective binary: logistic- returns predicated probability(not class) multi: softmax- returns predicated class(not probabilities) multi: softprob- returns predicated probability of each data point belonging to each class.
eval_metirc default according to objective(rmse for regression and error for classification), used for validation data. typical values: rmse(root mean square error), mse(mean absolute error), logloss(negative log-likelihood), error(binary classification error rate, 0.5 threshold), auc(area under the curve) 评价指标，一般使用 auc
seed default =0, used for reproducible results and also for parameter tuning.

Control Overfitting

There are in general two ways that you can control overfitting in xgboost.

The first way is to directly control model complexity.
The second way is to add regularization parameters

XGBoost 和 GBDT 区别

GBDT 是机器学习算法， XGBoost 是算法的工程上的实现
在使用CART作为基分类器时，XGBoost显式地加入了正则项来控制模型的复杂度，有利于防止过拟合，从而提高模型的泛化能力。
GBDT在模型训练时只使用了代价函数的一阶导数信息，XGBoost对代价函数进行二阶泰勒展开，可以同时使用一阶和二阶导数。
传统的GBDT采用CART作为基分类器，XGBoost支持多种类型的基分类器，比如线性分类器
传统的GBDT在每轮迭代时使用全部的数据，XGBoost则采用了与随机森林相似的策略，支持对数据进行采样。
传统的GBDT没有设计对缺失值进行处理，XGBoost能够自动学习出缺失值的处理策略。

GBDT 是串行的，XGBoost 的并行不是 tree 的粒度上，而是特征粒度上的。XGBoost 在训练之前，预先对数据进行排序保存，后面重复使用这个结构，大大减少了计算量。在进行节点分裂的时候，需要计算每个特征的增益，最终选增益最大的特征进行分类，这个时候可以多线程并行去计算。

LightGBM 和 XGBoost 的一些区别

树的增长方式

这个原先是lightGBM所特有的，然后xgboost 在最新的版本上也实现该中方式，这估计就是开源，并不是一成不变的。两者都是基于叶子进行增长的，但是增长的方式是不同的。一种是 level-wise 一种是 leaf-wise both xgboost and lightGBM use the leaf-wise growth strategy when growing the decision tree. there are two strategies that can be employed: level-wise and leaf-wise. level-wise maintains a balanced tree（平衡树，左右子树的高度差不超过1）;但是 leaf-wise 这个就比较随意了。这个是这两者的区别：当叶子总数相同的时候，leaf-wise 这种生长方式得到的树的深度是大于 level-wise 的深度的。 Compared to the case of level-wise growth, a tree grown with leaf-wise growth will be deeper when the number of leaves is the same.

find the best split

The key challenge in training a GBDT is the process of finding the best split for each leaf. The computational complexity is thus $O (n_{data} n_{features})$ . 这两者采用的方式都是：现阶段的中的的数据数据量和特征量都是很大的，所以这种方式是不可取的。然后这两种方法都是采用了 Histogram-based methods，这样最后的时间复杂度降低到： reducing the computational complexity to $O (n_{d a t a} n_{b i n s})$ .

这个复杂度是取决于number of bins。这个是引入了一个超参数，number of bins ，trade off between precision and time, 当更多的 bins 的时候，这个precision 会提高，但是 time 也会增大。

Ignoring sparse inputs

这个是处理缺省值（或者 0）的手段：两者在split 分裂点的时候，都是先不处理数值 0；然后找到分裂点之后，把0 放到哪边造成的loss 下降的比较大，然后就放到哪边。

Subsampling the data: Gradient-based One-Side Sampling (lightGBM) biased sampling(抽样的基本原则是随机性，但是在抽样过程中由于一系列因素造成偏差抽样，造成样本是不符合真实样本的分布) 这个就是缓解，也不是为了彻底让其均衡，lightgbm increases the weigths of the samples. This means that it is more efficient to concentrate on data points with larger gradients. In order to mitigate this problem, lightGBM also randomly samples from data with small gradients. lightGBM increases the weight of the samples with small gradients when computing their contribution to the change in loss (this is a form of importance sampling, a technique for efficient sampling from an arbitrary distribution).

xgboost + lr

如果清楚GBDT原理的话（机器学习-一文理解GBDT的原理-20171001），最后模型预测结果也是将叶子结点进行线性加总，权值就是训练得到的各叶子结点的权重。如果将GBDT得到的叶子节点特征喂入LR的话，其实相当于用逻辑回归对叶子结点的权重进行了重新训练计算，在这里GBDT只是将数据中的非线性关系进行了一层转换，从而可以利用lr进行线性组合。

除了原理和实现方式之外，其实还有一些小的细节需要注意的，比如：

1、进行LR训练时，原始特征是否要加进去；原始特征的加入在xgb欠拟合的时候起的作用更大，当xgb树的颗数上升到一定数量时，原始特征的加入没什么提升。 2、gbdt训练的模型效果对最终LR的训练出来的模型有多大的影响；

当xgb训练充分时，lr直接利用xgb叶子结点的编码特征在合适的惩罚系数下可以训练得到和xgb一样的甚至更好的效果（不显著）；

3、树的棵数，叶子结点数，学习率，树的深度，采样会如何影响最终的模型效果等等；

lr利用xgb叶子结点的编码特征进行训练，分别在不同的C下训练得到train和test的auc值

简单地说，就是把gbdt的输出，作为logistic regression的输入，最后得到一个logistic regression模型。例如，gbdt里有3棵树T1,T2,T3，每棵树的叶节点个数为4，第i个树的第j个叶节点是Li,j。当gdbt训练完成之后，样本X1在第一棵树中被分到了第3个叶节点上，也就是L1,3，那么这个样本在T1上的向量表达为(0,0,1,0)。

样本X1在T2被分到了L2,1，那么X1在T2上的向量表达为(1,0,0,0) 样本X1在T3被分到了L3,4，那么X1在T3上的向量表达为(0,0,0,1) 那么X1在整个gbdt上的向量表达为

(0,0,1,0,1,0,0,0,0,0,0,1) 所以每个样本都会被表示为一个长度为12的0-1向量，其中有3个数值是1。

然后这类向量就是LR模型的输入数据。

LR 和 XGOOST 是 CTR 中常用的两种模型，二者各有优缺点，在 facebook 中使用 XGBOOST（提取特征） + LR（预测）的方式。GBDT 模型擅长处理连续特征值，而 LR 则擅长处理离散特征值。

在工业界，很少直接将连续值作为特征喂给逻辑回归模型，而是将连续特征离散化为一系列0、1特征交给逻辑回归模型，这样做的优势有以下几点：

1
2
3
4
5


1.稀疏向量内积乘法运算速度快，计算结果方便存储，容易scalable（扩展）。
2.离散化后的特征对异常数据有很强的鲁棒性：比如一个特征是年龄>30是1，否则0。如果特征没有离散化，一个异常数据“年龄300岁”会给模型造成很大的干扰。
3.逻辑回归属于广义线性模型，表达能力受限；单变量离散化为N个后，每个变量有单独的权重，相当于为模型引入了非线性，能够提升模型表达能力，加大拟合。
4.离散化后可以进行特征交叉，由M+N个变量变为M*N个变量，进一步引入非线性，提升表达能力。
5.特征离散化后，模型会更稳定，比如如果对用户年龄离散化，20-30作为一个区间，不会因为一个用户年龄长了一岁就变成一个完全不同的人。当然处于区间相邻处的样本会刚好相反，所以怎么划分区间是门学问。

先用已有特征训练Xgboost模型，然后利用Xgboost模型学习到的树来构造新特征，最后把这些新特征加入原有特征一起训练模型。构造的新特征向量是取值0/1的，向量的每个元素对应于Xgboost模型中树的叶子结点。当一个样本点通过某棵树最终落在这棵树的一个叶子结点上，那么在新特征向量中这个叶子结点对应的元素值为1，而这棵树的其他叶子结点对应的元素值为0。新特征向量的长度等于XGBoost模型里所有树包含的叶子结点数之和。最后将新的特征扔到LR模型进行训练。

xgboost+lr模型融合方法用于分类或者回归的思想最早由facebook在广告ctr预测中提出，其论文Practical Lessons from Predicting Clicks on Ads at Facebook有对其进行阐述。在这篇论文中他们提出了一种将xgboost作为feature transform的方法。大概的思想可以描述为如下：先用已有特征训练XGBoost模型，然后利用XGBoost模型学习到的树来构造新特征，最后把这些新特征加入原有特征一起训练模型。构造的新特征向量是取值0/1的，向量的每个元素对应于XGBoost模型中树的叶子结点。当一个样本点通过某棵树最终落在这棵树的一个叶子结点上，那么在新特征向量中这个叶子结点对应的元素值为1，而这棵树的其他叶子结点对应的元素值为0。新特征向量的长度等于XGBoost模型里所有树包含的叶子结点数之和。最后将新的特征扔到LR模型进行训练。实验结果表明xgboost+lr能取得比单独使用两个模型都好的效果。

实现角度

在实践中的关键点是如何获得每个样本落在训练后的每棵树的哪个叶子结点上。

A、对于Xgboost来说，因为其有sklearn接口和自带接口，因此有两种方法可以获得：

①、sklearn接口。可以设置pre_leaf=True获得每个样本在每颗树上的leaf_Index。XGBoost官方文档

②、自带接口。利用apply()方法可以获得leaf indices。SKlearn GBDT API

！！！此过程需注意：无论是设置pre_leaf=True还是利用apply()方法，获得的都是叶子节点的 index，也就是说落在了具体哪颗树的哪个叶子节点上，并非是0/1变量，因此需要自己动手去做 onehot 编码。onehot 可以在 sklearn 的预处理包中调用即可。

onehot的知识点可以参看这篇博客。

B、对于其它的树模型，如随机森林和GBDT，我们只能使用apply()方法获得leaf indices。

XGBoost+LR融合的原理和简单实现 xgboost和LR模型级联

对于原始数据的编码方式

连续特征离散化
离散的特征one-hot 或者特征组合
使用 boosting 方式将连续特征离散化（xgboost 特征提取）XGBoost + LR 是一种有效的特征工程手段

利用 GBDT+LR 融合的方案有很多好处，利用 GDBT 主要是发掘有区分度的特征和特征组合：

LR 模型无法实现特征组合，但是模型中特征组合很关键，依靠人工经验非常耗时而且不一定能有好的效果。利用 GBDT 可以自动发现有效的特征、特征组合。GBDT 每次迭代都是在减少残差的梯度方向上面新建一棵决策树，GBDT 的每个叶子结点对应一个路径，也就是一种组合方式。

为进一步理解GBDT-LR的运行过程，同时排查编程实现的错误。这里对整个过程中的数据形态变化进行检视。数据的处理流程为：

原始特征数据 -> 标签编码 -> GBDT -> 独热编码 -> LR -> 输出。

xgboost和LR模型级联

1
2
3
4
5


更新：：：经过实践检验后，我们发现xgb+lr相比xgb几乎没有提升。xgb不像DNN，其本身的特征交叉能力是有限的，用xgb作为lr的特征交叉部分是远远不够的，所以还是需要在LR这边做大量的人工特征交叉设计，我们当时并没有一个实践检验过很好的单LR模型，因为我们没有时间和人力做精细的人工特征交叉设计。

其实当人工的特征设计足够优秀的时候，特征维度很丰满的单LR已经很强了，xgb那点叶子结点的影响力其实不大。一个好的单LR模型其实根本不太需要xgb这点交叉，一个单xgb模型本身也足够优秀，级联一个垃圾LR不如不级联。 Facebook在2014年的思路是没错的，xgb能够很好的处理连续性型特征，LR来补齐xgb对离散类特征信息的盲区。但是工程上如果想有提升，不可避免还是要做特征工程，要踩得坑太多而且收益甚微，有这个精力不如搞深度学习。

想要有质变的话建议直接上深度学习，Wide&Deep用Deep部分代替xgb，DNN的特征交叉能力和信息量要比xgb大得多。

xgboost 确实能够处理连续特征，但是在特征交叉方面，还是有不足的，所以要么是特征工程，增强对离散特征LR。要么是深度网络， DNN（或者transformer）的特征交叉能力和信息量要比xgb大得多。 xgboost也就是做了特征筛选和特征交叉。人工特征工程做的够好，LR是可以俯瞰众生，但是考虑到花费的人力物力，还是选模型级联吧！特征工程慢慢做= =

LR是广义线性模型，能够并行化处理大量亿万级特征的训练样本。LR模型是只使用离散型特征的，连续型特征不是不能用，而是用了以后效果不好，对模型有负影响。LR快速高效，完全可以处理高纬度稀疏的离散化特征。

文章目录

lightGBM调参(常用参数)

Advantages of LightGBM

lightGBM调参(常用参数)

referrence

XGBoost

Advantage of XGBoost

XGBoost Parameters

general parameters

booster parameters

learning task parameters

Control Overfitting

XGBoost 和 GBDT 区别

LightGBM 和 XGBoost 的一些区别

xgboost + lr