将集成学习分为基本集成学习方法、高级集成学习方法和基于集成学习的算法。

Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to improve the overall performance. This can be achieved in various ways, which you will discover in this article.

基本集成学习方法

(1)Max Voting

多人投票机制,使用所有的模型预测结果的多数,常用于分类问题。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)

final_pred = np.array([])
for i in range(0,len(x_test)):
    final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]]))

(2)Averaging

Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems. 如果是回归问题,那么求解均值;如果是分类问题,那么求解概率

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3

(3)Weighted Averaging

和上面的思想类似,是有权重的,权重的依据是模型的准确率等指标。如果模型效果越好,那么权重越高。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model1 = tree.DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)

高级集成学习方法

这就引出了如何组合这些模型的问题。我们可以用三种主要的旨在组合弱学习器的「元算法」:

  • bagging,该方法通常考虑的是同质弱学习器,相互独立地并行学习这些弱学习器,并按照某种确定性的平均过程将它们组合起来。
  • boosting,该方法通常考虑的也是同质弱学习器。它以一种高度自适应的方法顺序地学习这些弱学习器(每个基础模型都依赖于前面的模型),并按照某种确定性的策略将它们组合起来。
  • stacking,该方法通常考虑的是异质弱学习器,并行地学习它们,并通过训练一个「元模型」将它们组合起来,根据不同弱模型的预测结果输出一个最终的预测结果。

(1)Bagging

思想:使用同质模型和相同的数据集,大概率得到的是相同的结果,这个是后boosting 算法中使用了一种采样方式 Bootstrapping

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement.

算法步骤:

  1. 从训练集 $𝑆$中有放回的随机选取数据集 $𝑀(∣𝑀∣<∣𝑆∣) $;
  2. 生成一个分类模型 $𝐶 $;
  3. 重复以上步骤 $m$次,得到$m$个分类模型 $𝐶_1$, $𝐶_2 $,…, $𝐶_m $;
  4. 对于分类问题,每一个模型投票决定,少数服从多数原则; 对于回归问题,取平均值。

(2)Boosting

思想:如果模型本身的准确率就不高,那么多个模型组合起来不见得好。所以Boosting 的思想是在针对上一个模型的错误来训练当下的模型。

(3)Stacking

以二层的stacking 作为讲解。

算法步骤:

1). The train set is split into 10 parts. 2). A base model (suppose a decision tree) is fitted on 9 parts and predictions are made for the 10th part. This is done for each part of the train set. 3). The base model (in this case, decision tree) is then fitted on the whole train dataset. 4). Using this model, predictions are made on the test set. 5). Steps 2 to 4 are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set. 6). The predictions from the train set are used as features to build a new model. 7). This model is used to make final predictions on the test prediction set.

关键步骤是以第一层模型的输出结果作为第二次的输入。给出代码实例。

定义一个通用函数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def Stacking(model,train,y,test,n_fold):
   folds=StratifiedKFold(n_splits=n_fold,random_state=1)
   test_pred=np.empty((test.shape[0],1),float)
   train_pred=np.empty((0,1),float)
   for train_indices,val_indices in folds.split(train,y.values):
      x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
      y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]

      model.fit(X=x_train,y=y_train)
      train_pred=np.append(train_pred,model.predict(x_val))
      test_pred=np.append(test_pred,model.predict(test))
    return test_pred.reshape(-1,1),train_pred

第一层有两个基本的模型:决策树和K最近邻。

1
2
3
4
model1 = tree.DecisionTreeClassifier(random_state=1)
test_pred1 ,train_pred1=Stacking(model=model1,n_fold=10, train=x_train,test=x_test,y=y_train)
train_pred1=pd.DataFrame(train_pred1)
test_pred1=pd.DataFrame(test_pred1)
1
2
3
4
model2 = KNeighborsClassifier()
test_pred2 ,train_pred2=Stacking(model=model2,n_fold=10,train=x_train,test=x_test,y=y_train)
train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)

第一层的两个模型得到了predict的值,然后将结果拼接起来作为第二层模型的输入数据集,第二层模型的label 还是原始训练数据集的label。

1
2
3
4
5
6
df = pd.concat([train_pred1, train_pred2], axis=1) # 基本模型prediction 的结果当做训练的输入
df_test = pd.concat([test_pred1, test_pred2], axis=1)  # 为了保证同分布,这里对test 数据集也做相同的转换

model = LogisticRegression(random_state=1)
model.fit(df,y_train) # 数据的label 还是作为第一层模型的label 
model.score(df_test, y_test) # 最后是模型的输出

在 test 数据集上的预测,需要经过两层模型:首先输入到(决策树和k最近邻)的模型中,其结果再输入到第二层模型决策树中,决策树的预测值就是最后的结果。

每一轮根据上一轮的分类结果动态调整每个样本在分类器中的权重,训练得到k个弱分类器,他们都有各自的权重,通过加权组合的方式得到最终的分类结果(综合所有的基模型预测结果)。

(4)Blending

Blending 相比于Stacking而言,Blending 是在训练集上train,在验证集和测试集上prediction,然后使用验证集和测试集的prediction作为 features 去学习下一个模型。

算法步骤:

1). The train set is split into training and validation sets. 2). Model(s) are fitted on the training set. 3). The predictions are made on the validation set and the test set. 4). The validation set and its predictions are used as features to build a new model. 5). This model is used to make final predictions on the test and meta-features.

同样给出了sample codes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model1 = tree.DecisionTreeClassifier()
model1.fit(x_train, y_train)
val_pred1=model1.predict(x_val)
test_pred1=model1.predict(x_test)
val_pred1=pd.DataFrame(val_pred1)
test_pred1=pd.DataFrame(test_pred1)

model2 = KNeighborsClassifier()
model2.fit(x_train,y_train)
val_pred2=model2.predict(x_val)
test_pred2=model2.predict(x_test)
val_pred2=pd.DataFrame(val_pred2)
test_pred2=pd.DataFrame(test_pred2)

第二层使用了逻辑回归在test set 上进行预测。

1
2
3
4
5
6
df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1)
df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1)

model = LogisticRegression()
model.fit(df_val,y_val)
model.score(df_test,y_test)

基于集成学习的算法

Boosting 和Bagging 是最常使用的两类集成算法。Bagging algorithms的代表是Random forest。Boosting algorithms有以下几种:

  • AdaBoost
  • GBM
  • XGBM
  • Light GBM
  • CatBoost

(1)Random Forest

Random Forest 的算法步骤: 1). Random subsets are created from the original dataset (bootstrapping). 2). At each node in the decision tree, only a random set of features are considered to decide the best split. 3). A decision tree model is fitted on each of the subsets. 4). The final prediction is calculated by averaging the predictions from all decision trees.

总结来说,随机森林随机 选择数据点和特征,然后组成了多棵树的集合(森林)

常见的超参数 1 ). n_estimators (子树的个数)

  • It defines the number of decision trees to be created in a random forest.
  • Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.

2 ). max_features (使用最多特征的数量进行建立树)

  • It defines the maximum number of features allowed for the split in each decision tree.
  • Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.

3 ). max_depth ( 是一个树的最大深度,也是所有树的最大深度)

  • The maximum depth of the tree.

4 ). min_samples_leaf

  • This defines the minimum number of samples required to be at a leaf node.
  • Smaller leaf size makes the model more prone to capturing noise in train data.

(2)AdaBoost

Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly. 提升策略主要是从分错类的样本角度提升,通过给予更大的权重。下一个模型预测的时候,着重该样本。

(3)Gradient Boosting (GBM)

Gradient Boosting or GBM is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree. 提升策略是从梯度角度考虑,使用决策树模型减少loss,进而提高模型的预测能力。

优点

  1. 在分布稠密的数据集上,泛化能力和表达能力都很好,这使得GBDT在Kaggle的众多竞赛中,经常名列榜首。

缺点

  1. GBDT在高维稀疏的数据集上,表现不如支持向量机或者神经网络。(所以说这种树的模型是有利于处理连续数值, 如果是 one-hot 就不建议使用gbdt)
  2. GBDT在处理文本分类特征问题上,相对其他模型的优势不如它在处理数值特征时明显。

(4)XGBoost

XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique. 提升速度和使用正则方式处理过拟合。

主要采用以下的技术: 1). Regularization:

  • Standard GBM implementation has no regularisation like XGBoost.
  • Thus XGBoost also helps to reduce overfitting. 2). Parallel Processing:
  • XGBoost implements parallel processing and is faster than GBM .
  • XGBoost also supports implementation on Hadoop. 3). High Flexibility:
  • XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model. 4). Handling Missing Values:
  • XGBoost has an in-built routine to handle missing values. 5). Tree Pruning:
  • XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain. 6). Built-in Cross-Validation:
  • XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

GBDT与XGboost联系与区别

(1) GBDT是机器学习算法,XGBoost是该算法的工程实现。 (2) 在使用CART作为基分类器时,XGBoost显式地加入了正则项来控制模型的复杂度,有利于防止过拟合,从而提高模型的泛化能力。 (3) GBDT在模型训练时只使用了代价函数的一阶导数信息,XGBoost对代价函数进行二阶泰勒展开,可以同时使用一阶和二阶导数。 (4) 传统的GBDT采用CART作为基分类器,XGBoost支持多种类型的基分类器,比如线性分类器。 (5) 传统的GBDT在每轮迭代时使用全部的数据,XGBoost则采用了与随机森林相似的策略,支持对数据进行采样。 (6) 传统的GBDT没有设计对缺失值进行处理,XGBoost能够自动学习出缺失值的处理策略。

(6)Light GBM

Light GBM beats all the other algorithms when the dataset is extremely large**. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset. LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern. The images below will help you understand the difference in a better way.

效果上使用基于叶子的树的生长方式,而非层次的生成方式。

(7)CatBoost

Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels (i.e. they are highly cardinal), performing one-hot-encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset. CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms. Here is an article that explains CatBoost in detail.

从名字上就知道在处理类别信息很多的数据集中,不需要预处理,直接使用。

代码实现

1). A Comprehensive Guide to Ensemble Learning (with Python codes):关于集成学习全面的讲解。 2). How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras:搭建的是CNN 的网络。 3). How to Implement Stacked Generalization (Stacking) From Scratch With Python: 仅仅使用了python,没有使用其他的各种机器学习的库。