主要介绍机器学习中的数据预处理(这里主要讲的是数值型数据的预处理,而非中文数据预处理),包括 data cleaning、data integration、 data transformation、data reduction、data imbalanced 和一些概念。 data normalization 是经常用到的。

数据预处理的时候需要注意 missing values、noisy data (outlier) ,对于这种现象应该分成要不要处理和如何处理两个问题。

数据清洗 (缺省值、异常值的处理)、数据整合 (子表 合并到 or 成主表)、数据转换 (one-hot or label encoding, 连续数值离散化)、和数据降维 (可以单独的成一章)。这个几个步骤应该是熟记于心的。

Data Cleaning

这个步骤主要处理 missing values 和 noisy data (outlier).

Before starting handling missing values it is important to identify the missing values and know with which value they are replaced. You should be able to find this out by combining the metadata information with exploratory analysis. If they are completely at random, they don’t give any extra information and can be omitted. On the other hand, if they’re not at random, the fact that a value is missing is itself information and can be expressed as an extra binary feature. 在处理 missing values 之前需要判断一下,该missing values 的分布,如果是random (给不了任何信息,那么可以omitted);如果有一定的规律,那么可以当做是一种 extra feature。

对于missing values ,可以分成两个问题,要不要处理和如何处理,弄清楚了上述问题。那么现在是可以解决如何处理 missing values 这个问题。首先指出,一些算法和库函数,比如XGBoost 是能够handle missing vlaues的。具体说来有以下处理手段:

  • ignore the tuple;
  • fill in the missing value manually
  • use a global constant to fill in the missing value
  • use the attribute mean to fill in the missing value (均值)
  • use the most probable value to fill in the missing value (mode 众数)
  • 有时候就是根据某几个特征然后弄一个简单的回归模型,根据模型进行predict

关于这几种方法如何去选择,我如果说 “it depends”,那么其他人不认为这是一个具有说服力的答案,他们更像知道 it depends what, and when and why to use specific method? 我认为应该是根据缺省值程度和重要性进行经验性的选择,这也去就是 empirical study吧。

如果选择使用填充策略,那么有以下几种形式:

For filling up missing values with common strategies, sklearn provides a SimpleImputer. The four main strategies are mean, most_frequent, median and constant (don’t forget to set the fill_value parameter).

除了sklearn 自带的函数,还可以使用以下的方式

Other popular ways to impute missing data are clustering the data with the k-nearest neighbor (KNN) algorithm or interpolating the values using a wide range of interpolation methods. Both techniques are not implemented in sklearn’s preprocessing library and won’t be discussed here.

接着是 noisy data (outlier),我的观点是首先得认识到这个是错误的数据(使用 $3\sigma$定理去筛选异常值),不是真实的数据来源,可能是来自人为的笔误 或者仪器记录的问题,这个是需要修改的。可以使用聚类 (clustering) 进行noisy data 的检测,找到之后这个就类似 missing value了,可以采取以上的手段进行操作,应该注意到的这个 noisy data 所占比例不会很高,否则就成了主要的数据分布了。

Data Integration

处理数据库数据,经常是需要处理子表信息的,那么必然存在着主表,而子表系信息往往是主表信息的某一方面的细化。所以有必要将两者连接起来。如何连接?一般使用聚合函数(mean, variance, std, max, min,median, mode )等方式得到子表的统计信息,然后将两者连接。

Polynomial features

多项式 feature

Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target. They are mostly used to add complexity to linear models with little features, or when we suspect the effect of one feature is dependent on another feature. 多项式 feature 基本的假设是特征和 target 之间存在非线性关系。

Categorical features

Unfortunately, sklearn’s machine learning library does not support handling categorical data. Even for tree-based models, it is necessary to convert categorical features to a numerical representation. 即使是树模型,那么也是需要将 分类的feature 转换成数值的表示形式。

Nominal data 与Ordinal data均指向类别数据;如果类别数据不存在排序问题就是norminal data;如果存在 排序就是ordinal data

An ordinal feature is best described as a feature with natural, ordered categories and the distances between the categories is not known.

对于存在排序数据 (ordinal data)的关键是可以转换成有大小关系的数字表示

1
2
3
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X.edu_level = encoder.fit_transform(X.edu_level.values.reshape(-1, 1))

上面的代码不能处理missing values 的问题,可以使用以下的代码

1
2
3
4
5
6
7
cat = pd.Categorical(X.edu_level, 
                     categories=['missing', 'low', 
                                 'medium', 'high'], 
                     ordered=True)
cat.fillna('missing')
labels, unique = pd.factorize(cat, sort=True)
X.edu_level = labels  

put it in the most common category or to put it in the category of the value in the middle when the feature is sorted.

对于无序的数据(nominal features)最长使用的方法是 one-hot

The most popular way to encode nominal features is one-hot-encoding. Essentially, each categorical feature with n categories is transformed into n binary features.

Numerical features

Just like categorical data can be encoded, numerical features can be ‘decoded’ into categorical features. The two most common ways to do this are discretization and binarization. 连续值特征可以转化成 category特征,有两种最常见的方式去转换 (discretization 和 binarization)

Discretization

Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.

Binarization

Feature binarization is the process of tresholding numerical features to get boolean values. Or in other words, assign a boolean value (True or False) to each sample based on a threshold. Note that binarization is an extreme form of two-bin discretization. Binarization 是 Discretization 的极端化的例子

In general binarization is useful as a feature engineering technique for creating new features that indicate something meaningful. Just like the above-mentioned MissingIndicator is used to mark meaningful missing values. 对于二值化,重大的意义就是用来标识 某个特征是否是 meaningful 的

Data Transformation

In data transformation, the data are transformed or consolidated into forms appropriate for mining. 这里想要澄清的是很多相同的内容都可以用不同的方式表达,并且可以放在数据处理的不同阶段,并且这种工作不是一次性完成的,而是迭代的 until you run out your patience and time. 首先我接触的最常见的就是 discrete variables -> continuous variables. 当然对于 discrete variables,基于树结构的机器学习模型是可以处理的,这里想说的是有这种方式。这种 transformation 常见的处理方式: one-hot 或者 label encoding.

如果按照 data transformation的预设,那么 normalization 就也属于该模块的内容。 不论是在 machine learning 还是在 图像处理的时候,对于原始的数据经常采取 normalization. 一方面这个是可以预防梯度消失 或者 gradient exploding, 如果你采用了 Sigmoid的激活函数的话。另一方面我认为更加重要的原因是将 不同的数据放在了同一个尺度下,如果你采取了 normalization之后。

非线性变换

  1. 映射到均匀分布 通过执行一个排序变换(rank transformation), 它能够使 异常的分布(unusual distributions) 被平滑化, 并且能够做到 比使用缩放器(scalers)方法 更少地受到离群值的影响。 然而,它的确使 特征间 及 特征内 的 关联和距离 被打乱了。

代码 example

{% fold 开/合 %}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
                                   'spider', 'snake'],
                        'Number_legs': [4, 2, 4, 8, np.nan]})

df['default_rank'] = df['Number_legs'].rank()
df['max_rank'] = df['Number_legs'].rank(method='max')
df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
df['pct_rank'] = df['Number_legs'].rank(pct=True)
df
                        

{% endfold %}

  1. MinMaxScaler

计算过程

1
2
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

适合场景

The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter (default at (0,1)). This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider another scaler. 适合在非高斯分布的数据集上使用;不适合在有outlier的情况下使用。

在sklearn 中是可以指定range() 范围的

1
2
3
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-3,3))
scaler.fit_transform(X.f3.values.reshape(-1, 1))

Data Reduction

一般来说很少提及到到 data reduction的必要性,如果非要给出原因,那么可以从时间和空间的角度进行考虑。更加需要关注的是如何做的问题。

我的理解reduction 可以从两个维度进行考虑,假设一个 matrics A 是 m*n,这个是一个二维的矩阵,那么可以从 行列两方面入手。映射到机器学习中一般这样描述 从dimension 和 data两个角度去描述,分别称之为 dimension reduction 和 data compression. 前者指的是特征的选取,后者是数据size的减少。 dimension reduction: where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. data compression: PCA 线性降维 to reduce the data set size. 这个是针对某一个特征展开的。

Data Standardization

** 是什么?**

中心化: 均值为0,对方差没有要求

$$ x ^ { \prime } = x - \mu $$

标准化:服从正太分布 (0, 1)

$$ x ^ { \prime } = \frac { x - \overline { x } } { \sigma } $$

Sklearn its main scaler, the StandardScaler, uses a strict definition of standardization to standardize data. It purely centers the data by using the following formula, where u is the mean and s is the standard deviation. 就是将上述的数据聚集到mean 和std 附近,形成一个比较集中的分布。

归一化有两种: mean normalization 和 min-max normalization

mean normalization $$ x ^ { \prime } = \frac { x - \operatorname { mean } ( x ) } { \max ( x ) - \min ( x ) } $$

min-max normalization

$$ x ^ { \prime } = \frac { x - \operatorname { min } ( x ) } { \max ( x ) - \min ( x ) } $$

使用场景

The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter (default at (0,1)). This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider another scaler. 在非高斯分布或者 std 比较小的情况下。并且数据中不能有 outlier(因为公式中使用到了 outlieer)

$$ x ^ { \prime } = \frac { x } { \max ( abs(x) ) } $$

MaxAbs Scaler

The MaxAbsScaler works very similarly to the MinMaxScaler but automatically scales the data to a [-1,1] range based on the absolute maximum. This scaler is meant for data that is already centered at zero or sparse data. It does not shift/center the data, and thus does not destroy any sparsity.

Robust Scaler

If your data contains many outliers, scaling using the mean and standard deviation of the data is likely to not work very well. In these cases, you can use the RobustScaler. It removes the median and scales the data according to the quantile range.

为什么?

  • 提高模型的准确率 比如说两个特征,一个特征的范围是0-100, 另一个是-2000 到2000, 这个使用 欧式距离进行计算的时候,两个特征的差值很大,特征并没有站在同一个维度上。

  • 提高模型的速度 还是上面的例子,两个特征x1 x2的取值范围比较大,那么学习率上变得波动,所以学习时间会变长。

  • 深度学习中数据归一化可以防止模型梯度爆炸 以sigmoid 函数为例解释就行

怎么做? 定义就是表示怎么做

适用范围

概率模型不需要归一化,因为它们不关心变量的值,而是关心变量的分布和变量之间的条件概率,如决策树、rf。而像adaboost、svm、lr、KNN、KMeans之类的最优化问题就需要归一化。

映射到 N(0,1) 的这种行为,叫做归一化。Feature scaling is the method to limit the range of variables so that they can be compared on common grounds. 有三个主要的原因。

  • Because most of the Machine Learning models are based on Euclidean Distance.

    Age- 40 and 27 Salary- 72000 and 48000

    这两个特征,这两个距离相差很大;但是这个并不是我们想要的,我们想要的是相对值,而不是绝对值。

  • 即使最后的loss function不是 euclidean distance,比如说decision tree,实践证明经过正则化的之后的数据的训练速度是快于 没有经过正则化的数据的。

  • 经过归一化之后,数据是不容易出现梯度消失或者梯度爆炸的。

  • 很多模型的基本假设 就是 N(0,1) 高斯分布。

实现的三种手段:

  • rescaling (min-max normalization)

$$ x ^ { \prime } = \frac { x - \min ( x ) } { \max ( x ) - \min ( x ) } $$

  • mean normalization

$$ x ^ { \prime } = \frac { x - \operatorname { average } ( x ) } { \max ( x ) - \min ( x ) } $$

  • standardization

$$ x ^ { \prime } = \frac { x - \overline { x } } { \sigma } $$

Data Normalization

Normalization is the process of scaling individual samples to have unit norm. In basic terms you need to normalize data when the algorithm predicts based on the weighted relationships formed between data points. Scaling inputs to unit norms is a common operation for text classification or clustering. Normalization 是为了得到 unit norm。一般在 text classification or clustering 比较常见。

One of the key differences between scaling (e.g. standardizing) and normalizing, is that normalizing is a row-wise operation, while scaling is a column-wise operation. normalizing 是行为单位进行的; scaling 是列为单位进行的。

(1)max

The max norm uses the absolute maximum and does for samples what the MaxAbsScaler does for features.

$$ x ^ { \prime } = \frac { x } { \max ( x ) } $$

(2) l1

The l1 norm uses the sum of all the values as and thus gives equal penalty to all parameters, enforcing sparsity. (增加了稀疏性)

$$ x ^ { \prime } = \frac { x } { \sum ( x ) } $$

(3)l2

The l2 norm uses the square root of the sum of all the squared values. This creates smoothness and rotational invariance. Some models, like PCA, assume rotational invariance, and so l2 will perform better.

1
x_normalized = x / sqrt(sum((i**2) for i in X))

data imbalanced

机器学习中的特征工程是有一定技巧可言,其中我觉得最为有趣的是: generation or you can call it abstraction. 对于特征的泛的提取才是对于问题本身或者特征的理解,这不仅需要积累,更需要对于该问题领域的专业知识, that’s all.举个栗子,在 “Home Credit Default Risk” (kaggle 竞赛)中,原始的训练数据有信贷金额和客户的年收入,这个时候 “credit_income_percent” 就是类似这种性质的提取特征。

Data Exploration using Numpy,Matplotlib and Pandas

(1)柱状图(Histogram)

柱状图需要有两个参数,一个是数据;一个是分成多少块(bins)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
fig=plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df['Age'],bins = 5)

#Labels and Tit
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('#Employee')
plt.show()

(2)散点图(scatter plot)

需要给定两个数据, $x$和$y$ 的数据。

1
2
3
4
5
6
7
8
9
ax = fig.add_subplot(1,1,1)
#Variable
ax.scatter(df['Age'],df['Sales'])
#Labels and Tit
plt.title('Sales and Age distribution')
plt.xlabel('Age')
plt.ylabel('Sales')
plt.show()

(3)Box-plot

箱形图使用seaborn 中的库函数。主要从图形上识别出 四分位线,二分位线。

1
2
3
import seaborn as sns 
sns.boxplot(df['Age']) 
sns.despine()

Feature scaling

The next logical step in our preprocessing pipeline is to scale our features. Before applying any scaling transformations it is very important to split your data into a train set and a test set. 在进行 scaling 操作的时候,首先要将train 和 test data 分割开来

Standardization

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. 标准化

Depending on your needs and data, sklearn provides a bunch of scalers: StandardScaler, MinMaxScaler, MaxAbsScaler and RobustScaler.

参考文献

1). Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas 2). Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas