Python Machine Learning Introduction of III Python Data Preparation
- 2021-11-24 02:14:06
- OfStack
Feature selection is difficult and time-consuming, and it also requires understanding of requirements and mastering professional knowledge. In the application development of machine learning, the most basic is feature engineering.
-Andrew Ng
1. Data preprocessing
Data preprocessing needs to be carried out according to the characteristics of the data itself, such as filling the missing ones, eliminating the invalid ones and deleting the redundant dimensions. These steps are closely related to the characteristics of the data itself.
1.1 Adjust the data scale
If each attribute of data measures data in different ways, it will bring great convenience to the algorithm model training of machine learning by adjusting the scale of data to measure all attributes according to the same scale.
In scikit-learn, the data scale can be adjusted through the Min Max Scalar class. The data of different measurement units are unified into the same scale, which is beneficial to the classification or grouping of things. Min Max Scalar actually scales the attribute to a specified range, or standardizes the data and aggregates it all around 0 with a variance of 1.
from numpy import set_printoptions
from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names = names)
# Divide data into input data and output results
array = data.values
X = array[:,0:8]
#X Equivalent to all data
Y = array[:,8]
#Y For the last class, That is, the result
transformer = MinMaxScaler(feature_range=(0,1)).fit(X)
# Data conversion
newX = transformer.fit_transform(X)
# Set the print format of data
set_printoptions(precision=3)
# Setting accuracy
print(newX)
[[0.353 0.744 0.59 ... 0.501 0.234 0.483]
[0.059 0.427 0.541 ... 0.396 0.117 0.167]
[0.471 0.92 0.525 ... 0.347 0.254 0.183]
...
[0.294 0.608 0.59 ... 0.39 0.071 0.15 ]
[0.059 0.633 0.492 ... 0.449 0.116 0.433]
[0.059 0.467 0.574 ... 0.453 0.101 0.033]]
1.2 Normalized data
Normalized data is an effective means to deal with Gaussian distribution data, and the output results have a median of 0 and a variance of 1. Normalization is performed using the Standard Scalar class provided by scikit-learn.
transformer = StandardScaler().fit(X)
# Data conversion
_newX = transformer.transform(X)
# Set data printing format
set_printoptions(precision=3)
# Setting accuracy
#print(_newX)
[[ 0.64 0.848 0.15 ... 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 ... -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 ... -1.103 0.604 -0.106]
...
[ 0.343 0.003 0.15 ... -0.735 -0.685 -0.276]
[-0.845 0.16 -0.471 ... -0.24 -0.371 1.171]
[-0.845 -0.873 0.046 ... -0.202 -0.474 -0.871]]
1.3 Standardized data
Standardized data is to process every row of data distance into 1 (vector distance is 1 in linear algebra), which is also called "returning to 1 yuan" processing, which is suitable for processing sparse data (with many data with zeros). The data returned to 1 yuan processing has a significant effect on improving the accuracy of neural network with weight input and K nearest neighbor algorithm with distance.
It is implemented using the Normalizer class in scikit-learn.
transformer = Normalizer().fit(X)
# Data conversion
__newX = transformer.transform(X)
# Setting data print format
set_printoptions(precision=3)
print(__newX)
[[0.034 0.828 0.403 ... 0.188 0.004 0.28 ]
[0.008 0.716 0.556 ... 0.224 0.003 0.261]
[0.04 0.924 0.323 ... 0.118 0.003 0.162]
...
[0.027 0.651 0.388 ... 0.141 0.001 0.161]
[0.007 0.838 0.399 ... 0.2 0.002 0.313]
[0.008 0.736 0.554 ... 0.241 0.002 0.182]]
1.42 value data
2-value data converts data to 2 values using values, with the greater than threshold set to 1 and the less than threshold set to 0.
It is implemented using the Binarizer class in scikit-learn.
transformer = Binarizer(threshold=0.0).fit(X)
# Data conversion
newX_ = transformer.transform(X)
# Setting data print format
set_printoptions(precision=3)
print(newX_)
[[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
...
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]
[1. 1. 1. ... 1. 1. 1.]]
2. Data feature selection
Before starting to build the model, performing feature selection is helpful to reduce the fitting degree of data, improve the accuracy of algorithm and reduce the training time.
2.1 Univariate feature selection
Statistical analysis can be used to analyze and select the data features that have the greatest influence on the results. In scikit-learn, it is realized by SelectKBest class, and uses a series of statistical methods to select data features, which is also the realization of Chi-square test.
The larger the chi-square value, the more inconsistent the actual observation value is with the theoretical inference value; The smaller the chi-square value, the more consistent the actual observation value is with the theoretical inference value; If the two values are exactly equal, the chi-square value is 0.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
array = data.values
X = array[:,0:8]
Y = array[:,8]
# Data characteristics are selected by chi-square test
# Feature selection
test = SelectKBest(score_func=chi2,k=4)
fit = test.fit(X,Y)
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
print(features)
After the implementation of Chi-square test, the score of each data feature and the four data features with the highest score are obtained. [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
...
[121. 112. 26.2 30. ]
[126. 0. 30.1 47. ]
[ 93. 0. 30.4 23. ]]
2.2 Recursive feature elimination
Recursive feature elimination (RFE) uses a sum model to carry out multiple rounds of training. After each round of training, several features of weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. According to the accuracy of each base model, the data features that have the greatest influence on the final prediction results are found.
# Recursive feature elimination
# Feature selection
model = LogisticRegression(max_iter=3000)# You need to manually set the maximum number of iterations
rfe = RFE(model,3)
fit = rfe.fit(X,Y)
print(" Number of features: ")
print(fit.n_features_)
print(" Selected features: ")
print(fit.support_)
print(" Feature ranking: ")
print(fit.ranking_)
Number of features:
3
Selected features:
[ True False False False False True True False]
Feature ranking:
[1 2 4 6 5 1 1 3]
2.3 Data Dimension Reduction
Common dimensionality reduction methods include PCA (principal component analysis) and LDA (linear discriminant analysis). In clustering algorithm, PCA is usually used to reduce the dimension of data, which is beneficial to simplify the analysis and visualization of data.
# Main component analysis ( Data dimensionality reduction )
# Selecting data features by major component analysis
pca = PCA(n_components=3)
fit = pca.fit(X)
print(" Explain variance: %s"% fit.explained_variance_ratio_)
print(fit.components_)
Interpretation variance: [0.889 0.062 0.026]
[[-2.022e-03 9.781e-02 1.609e-02 6.076e-02 9.931e-01 1.401e-02
5.372e-04 -3.565e-03]
[-2.265e-02 -9.722e-01 -1.419e-01 5.786e-02 9.463e-02 -4.697e-02
-8.168e-04 -1.402e-01]
[-2.246e-02 1.434e-01 -9.225e-01 -3.070e-01 2.098e-02 -1.324e-01
-6.400e-04 -1.255e-01]]
2.4 Feature Importance
Using bagged decision tree algorithm, random forest algorithm and extreme random tree algorithm, the importance of data features can be calculated.
# Feature importance
# Feature selection
model = ExtraTreesClassifier()
fit = model.fit(X,Y)
print(fit.feature_importances_)
[0.109 0.234 0.101 0.077 0.076 0.14 0.121 0.142]
Summarize
This paper mainly talks about data preparation in machine learning, including data preprocessing and data feature selection, which are all preparations for post-order optimization algorithm.