Normalization Normalization Regularization Discretization and Whitening of python Machine Learning

  • 2021-10-27 07:54:35
  • OfStack

Directory 1 Standardization
2 Normalization
3 regularization
4 Discretization
5 Albinism

The essence of machine learning is to discover the intrinsic features of data from data sets, which are often concealed by the external features such as sample specifications and distribution ranges. Data preprocessing is a series of operations to help machine learning models or algorithms find the inherent characteristics of data to the maximum extent, which mainly include standardization, normalization, regularization, discretization and whitening.

1 Standardization

Assuming that the sample set is a number of points on a 2-dimensional plane, the abscissa x is distributed in the interval [0, 100], and the ordinate y is distributed in the interval [0, 1]. Obviously, the dynamic range of x feature column and y feature column of sample set is quite different, and the influence on machine learning models (such as k-nearest neighbor or k-means clustering) will also be significantly different. The purpose of standardization processing is to avoid the influence of a feature column with too large dynamic range on the calculation results, and at the same time, it can improve the accuracy of the model. The essence of standardization is to subtract the mean value of each feature column of the sample set, and then divide it by the standard deviation for scaling.
The preprocessing sub-module preprocessing of Scikit-learn provides a fast standardization function scale (), which can directly return the standardized data set. Its code is as follows.


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> d = np.array([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]])
>>> d_scaled = pp.scale(d) #  For data sets d Standardize 
>>> d_scaled
array([[ 0. , -1.22474487, 1.40487872],
 [ 1.22474487, 0. , -0.84292723],
 [-1.22474487, 1.22474487, -0.56195149]])
>>> d_scaled.mean(axis=0) #  In the standardized data set, the average value of each feature column is 0
array([0., 0., 0.])
>>> d_scaled.std(axis=0) #  In the normalized data set, the standard deviation of each feature column is 1
array([1., 1., 1.])

The preprocessing sub-module preprocessing also provides a practical class StandardScaler, which stores the mean and standard deviation of each feature column on the training set, so that the same transformation can be applied to the test set in the future. In addition, the utility class StandardScaler can specify whether to centralize and scale to standard deviation with the with_mean and with_std parameters, as follows.


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> X_train = np.array([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]])
>>> scaler = pp.StandardScaler().fit(X_train)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> scaler.mean_ #  Average value of each feature column of training set 
array([ 1., -3., 3.])
>>> scaler.scale_ #  Standard deviation of each feature column of training set 
array([0.81649658, 1.63299316, 3.55902608])
>>> scaler.transform(X_train) #  Standardized training set 
array([[ 0. , -1.22474487, 1.40487872],
 [ 1.22474487, 0. , -0.84292723],
 [-1.22474487, 1.22474487, -0.56195149]])
>>> X_test = [[-1., 1., 0.]] #  Standardize the test set with the scaling standard of the training set 
>>> scaler.transform(X_test)
array([[-2.44948974, 2.44948974, -0.84292723]])

2 Normalization

Standardization is to centralize with the mean value of feature columns and scale with standard deviation. If the minimum value of each feature column of the data set is centralized, and then scaled according to the range (maximum-minimum value), that is, the data subtracts the minimum value of the feature column and will converge to the interval [0, 1], this process is called data normalization.
Scikit, the preprocessing sub-module of Scikit-learn, provides the MinMaxScaler class to implement the normalization function. The MinMaxScaler class has one important parameter, feature_range, which is used to set the range of data compression. The default is [0, 1].


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> X_train = np.array([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]])
>>> scaler = pp.MinMaxScaler().fit(X_train) #  The default data compression range is [0,1]
>>> scaler
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> scaler.transform(X_train)
array([[0.5 , 0. , 1. ],
 [1. , 0.5 , 0. ],
 [0. , 1. , 0.125]])
>>> scaler = pp.MinMaxScaler(feature_range=(-2, 2)) #  Set the data compression range to [-2,2]
>>> scaler = scaler.fit(X_train)
>>> scaler.transform(X_train)
array([[ 0. , -2. , 2. ],
 [ 2. , 0. , -2. ],
 [-2. , 2. , -1.5]])

Because normalization is very sensitive to outliers, most machine learning algorithms choose normalization for feature scaling. In principal component analysis (Principal Components Analysis, PCA), clustering, logistic regression, support vector machine, neural network and other algorithms, standardization is often the best choice. Normalization is widely used when distance measurement, gradient and covariance calculation are not involved, and data needs to be compressed to specific intervals. For example, when quantifying pixel intensity in digital image processing, normalization is used to compress data in intervals [0, 1].

3 regularization

Normalization is the operation of characteristic columns of data set, while regularization is the norm unitization of each data sample, which is the row operation of data set. Regularization is useful if you plan to quantify similarity between samples using operations such as dot product.

The preprocessing sub-module preprocessing of Scikit-learn provides a fast regularization function normalize (), which can directly return the regularized data set. The normalize () function specifies either the I1 normal form or the I2 normal form with the parameter norm, which defaults to the I2 normal form. I1 normal form can be understood as the sum of the absolute values of each element of a single sample is 1; I2 normal form can be understood as the arithmetic root of the sum of squares of each element of a single sample is 1, which is equivalent to the module (length) of the sample vector.


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> X_train = np.array([[ 1., -5., 8.], [ 2., -3., 0.], [ 0., -1., 1.]])
>>> pp.normalize(X_train) #  Use I2 Norm regularization, and the norm of each row is 1
array([[ 0.10540926, -0.52704628, 0.84327404],
 [ 0.5547002 , -0.83205029, 0. ],
 [ 0. , -0.70710678, 0.70710678]])
>>> pp.normalize(X_train, norm='I1') #  Use I1 Norm regularization, and the norm of each row is 1
array([[ 0.07142857, -0.35714286, 0.57142857],
 [ 0.4 , -0.6 , 0. ],
 [ 0. , -0.5 , 0.5 ]])

4 Discretization

Discretization (Discretization) divides continuous features into discrete eigenvalues, and its typical application is binarization of gray images. If continuous features are discretized using intervals of equal width, it is called K-bins discretization. preprocessing, the preprocessing sub-module of Scikit-learn, provides Binarizer class and KbinsDiscretizer class for discretization, the former is used for binary discretization, and the latter is used for K-bins discretization.


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> X = np.array([[-2,5,11],[7,-1,9],[4,3,7]])
>>> bina = pp.Binarizer(threshold=5) #  Specify 2 Valuation threshold is 5
>>> bina.transform(X)
array([[0, 0, 1],
 [1, 0, 1],
 [0, 0, 1]])
>>> est = pp.KBinsDiscretizer(n_bins=[2, 2, 3], encode='ordinal').fit(X)
>>> est.transform(X) # 3 The characteristic columns are discretized into 2 Section, 2 Section, 3 Segment 
array([[0., 1., 2.],
 [1., 0., 1.],
 [1., 1., 0.]])

5 Albinism

The word albino 1 is translated from whitening, which is difficult to look at the text and give meaning, and can only be understood from the effect after albino. Data whitening has two purposes, one is to remove or reduce the correlation between feature columns, and the other is to make the variance of each feature column equal to 1. Obviously, the first goal of whitening is principal component analysis (PCA), which reduces the dimension by principal component analysis and eliminates the feature dimension with small variance; The second goal of whitening is standardization.

PCA albinism and ZCA albinism can be divided into two types. PCA whitening transforms each feature dimension of the original data to the principal component axis, eliminates the correlation between features, and makes the variance of each principal component equal to 1. ZCA whitening inversely transforms the result of PCA whitening to each feature dimension axis of the original data, because ZCA whitening usually does not reduce the dimension.

Scikit-learn does not provide a special whitening method, but PCA whitening can be easily realized with the help of PCA class provided by component analysis submodule decomposition. The parameter whiten of the PCA class is used to set whether to remove the linear association between features, and the default value is False.

If a girl has a pile of blind date information on hand, the information of each handsome boy consists of many characteristic items such as age, height, weight, annual salary, number of real estate and number of cars. Through whitening operation, a data set with small feature dimension and direct comparison between samples can be generated.


>>> import numpy as np
>>> from sklearn import preprocessing as pp
>>> from sklearn.decomposition import PCA
>>> ds = np.array([
    [25, 1.85, 70, 50, 2, 1], 
    [22, 1.78, 72, 22, 0, 1], 
    [26, 1.80, 85, 25, 1, 0],
    [28, 1.70, 82, 100, 5, 2]
]) # 4 Samples, 6 Feature columns 
>>> m = PCA(whiten=True) #  Instantiate the principal component analysis class and specify the whitening parameters 
>>> m.fit(ds) #  Principal component analysis 
PCA(whiten=True)
>>> d = m.transform(ds) #  Return the results of principal component analysis 
>>> d #  Feature column from 6 A drop to 4 A 
array([[ 0.01001541, -0.99099492, -1.12597902, -0.03748764],
       [-0.76359767, -0.5681715 ,  1.15935316,  0.67477757],
       [-0.65589352,  1.26928222, -0.45686577, -1.8639689 ],
       [ 1.40947578,  0.28988421,  0.42349164,  1.2724972 ]])
>>> d.std(axis=0) #  Display variance of each feature column 
array([0.8660254 , 0.8660254 , 0.8660254 , 1.17790433])
>>> d = pp.scale(d) #  Standardization 
>>> d.std(axis=0) #  After standardization, the variance of each characteristic column is 1
array([1., 1., 1., 1.])

Someone on GitHub has provided the code for ZCA whitening. If necessary, please visit (https://github.com/mwv/zca).

The above is to talk about the standardization, normalization, regularization, discretization and whitening of python machine learning. Please pay attention to other related articles on this site for more information about python machine learning!


Related articles: