Python Machine Learning Introduction of III Python Data Preparation

  • 2021-11-24 02:14:06
  • OfStack

Directory 1. Data Preprocessing 1.1 Adjust Data Scale 1.2 Normalized Data 1.3 Standardized Data 1.42 Valued Data 2. Data Feature Selection 2.1 Univariate Feature Selection 2.2 Recursive Feature Elimination 2.3 Data Dimension Reduction 2.4 Feature Importance Summary

Feature selection is difficult and time-consuming, and it also requires understanding of requirements and mastering professional knowledge. In the application development of machine learning, the most basic is feature engineering.

-Andrew Ng

1. Data preprocessing

Data preprocessing needs to be carried out according to the characteristics of the data itself, such as filling the missing ones, eliminating the invalid ones and deleting the redundant dimensions. These steps are closely related to the characteristics of the data itself.

1.1 Adjust the data scale

If each attribute of data measures data in different ways, it will bring great convenience to the algorithm model training of machine learning by adjusting the scale of data to measure all attributes according to the same scale.

In scikit-learn, the data scale can be adjusted through the Min Max Scalar class. The data of different measurement units are unified into the same scale, which is beneficial to the classification or grouping of things. Min Max Scalar actually scales the attribute to a specified range, or standardizes the data and aggregates it all around 0 with a variance of 1.


from numpy import set_printoptions
from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler
 
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names = names)
 
# Divide data into input data and output results 
array = data.values
X = array[:,0:8]
#X Equivalent to all data 
Y = array[:,8]
#Y For the last class, That is, the result 
transformer = MinMaxScaler(feature_range=(0,1)).fit(X)
# Data conversion 
newX = transformer.fit_transform(X)
# Set the print format of data 
set_printoptions(precision=3)
# Setting accuracy 
print(newX)

[[0.353 0.744 0.59 ... 0.501 0.234 0.483]
[0.059 0.427 0.541 ... 0.396 0.117 0.167]
[0.471 0.92 0.525 ... 0.347 0.254 0.183]
...
[0.294 0.608 0.59 ... 0.39 0.071 0.15 ]
[0.059 0.633 0.492 ... 0.449 0.116 0.433]
[0.059 0.467 0.574 ... 0.453 0.101 0.033]]

1.2 Normalized data

Normalized data is an effective means to deal with Gaussian distribution data, and the output results have a median of 0 and a variance of 1. Normalization is performed using the Standard Scalar class provided by scikit-learn.


transformer = StandardScaler().fit(X)
# Data conversion 
_newX = transformer.transform(X)
# Set data printing format 
set_printoptions(precision=3)
# Setting accuracy 
#print(_newX)

[[ 0.64 0.848 0.15 ... 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 ... -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 ... -1.103 0.604 -0.106]
...
[ 0.343 0.003 0.15 ... -0.735 -0.685 -0.276]
[-0.845 0.16 -0.471 ... -0.24 -0.371 1.171]
[-0.845 -0.873 0.046 ... -0.202 -0.474 -0.871]]

1.3 Standardized data

Standardized data is to process every row of data distance into 1 (vector distance is 1 in linear algebra), which is also called "returning to 1 yuan" processing, which is suitable for processing sparse data (with many data with zeros). The data returned to 1 yuan processing has a significant effect on improving the accuracy of neural network with weight input and K nearest neighbor algorithm with distance.

It is implemented using the Normalizer class in scikit-learn.


transformer = Normalizer().fit(X)
# Data conversion 
__newX = transformer.transform(X)
# Setting data print format 
set_printoptions(precision=3)
print(__newX)

[[0.034 0.828 0.403 ... 0.188 0.004 0.28 ]
[0.008 0.716 0.556 ... 0.224 0.003 0.261]
[0.04 0.924 0.323 ... 0.118 0.003 0.162]
...
[0.027 0.651 0.388 ... 0.141 0.001 0.161]
[0.007 0.838 0.399 ... 0.2 0.002 0.313]
[0.008 0.736 0.554 ... 0.241 0.002 0.182]]

1.42 value data

2-value data converts data to 2 values using values, with the greater than threshold set to 1 and the less than threshold set to 0.

It is implemented using the Binarizer class in scikit-learn.


transformer = Binarizer(threshold=0.0).fit(X)
# Data conversion 
newX_ = transformer.transform(X)
# Setting data print format 
set_printoptions(precision=3)
print(newX_)

[[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
...
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]]

2. Data feature selection

Before starting to build the model, performing feature selection is helpful to reduce the fitting degree of data, improve the accuracy of algorithm and reduce the training time.

2.1 Univariate feature selection

Statistical analysis can be used to analyze and select the data features that have the greatest influence on the results. In scikit-learn, it is realized by SelectKBest class, and uses a series of statistical methods to select data features, which is also the realization of Chi-square test.

The larger the chi-square value, the more inconsistent the actual observation value is with the theoretical inference value; The smaller the chi-square value, the more consistent the actual observation value is with the theoretical inference value; If the two values are exactly equal, the chi-square value is 0.


from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
 
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
array = data.values
X = array[:,0:8]
Y = array[:,8]
 
# Data characteristics are selected by chi-square test 
# Feature selection 
test = SelectKBest(score_func=chi2,k=4)
fit = test.fit(X,Y)
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
print(features)

After the implementation of Chi-square test, the score of each data feature and the four data features with the highest score are obtained. [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
...
[121. 112. 26.2 30. ]
[126. 0. 30.1 47. ]
[ 93. 0. 30.4 23. ]]

2.2 Recursive feature elimination

Recursive feature elimination (RFE) uses a sum model to carry out multiple rounds of training. After each round of training, several features of weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. According to the accuracy of each base model, the data features that have the greatest influence on the final prediction results are found.


# Recursive feature elimination 
# Feature selection 
model = LogisticRegression(max_iter=3000)# You need to manually set the maximum number of iterations 
rfe = RFE(model,3)
fit = rfe.fit(X,Y)
print(" Number of features: ")
print(fit.n_features_)
print(" Selected features: ")
print(fit.support_)
print(" Feature ranking: ")
print(fit.ranking_)

Number of features:
3
Selected features:
[ True False False False False True True False]
Feature ranking:
[1 2 4 6 5 1 1 3]

2.3 Data Dimension Reduction

Common dimensionality reduction methods include PCA (principal component analysis) and LDA (linear discriminant analysis). In clustering algorithm, PCA is usually used to reduce the dimension of data, which is beneficial to simplify the analysis and visualization of data.


# Main component analysis ( Data dimensionality reduction )
# Selecting data features by major component analysis 
pca = PCA(n_components=3)
fit = pca.fit(X)
print(" Explain variance: %s"% fit.explained_variance_ratio_)
print(fit.components_)

Interpretation variance: [0.889 0.062 0.026]
[[-2.022e-03 9.781e-02 1.609e-02 6.076e-02 9.931e-01 1.401e-02
5.372e-04 -3.565e-03]
[-2.265e-02 -9.722e-01 -1.419e-01 5.786e-02 9.463e-02 -4.697e-02
-8.168e-04 -1.402e-01]
[-2.246e-02 1.434e-01 -9.225e-01 -3.070e-01 2.098e-02 -1.324e-01
-6.400e-04 -1.255e-01]]

2.4 Feature Importance

Using bagged decision tree algorithm, random forest algorithm and extreme random tree algorithm, the importance of data features can be calculated.


# Feature importance 
# Feature selection 
model = ExtraTreesClassifier()
fit = model.fit(X,Y)
print(fit.feature_importances_)

[0.109 0.234 0.101 0.077 0.076 0.14 0.121 0.142]

Summarize

This paper mainly talks about data preparation in machine learning, including data preprocessing and data feature selection, which are all preparations for post-order optimization algorithm.


Related articles: