Method of python Missing Value Processing (Imputation)

  • 2021-07-06 11:20:26
  • OfStack

1. How to deal with missing values

For a variety of reasons, many real-world datasets contain missing data, often encoded as spaces, nans, or other placeholders. However, such a dataset is not compatible with scikit-learn algorithm, because most learning algorithms default that the elements in the array are all numeric values, so the prime elements have their own representative meanings.

One basic strategy for using incomplete datasets is to discard entire rows or columns of values that contain missing values, but doing so wastes a lot of valuable data. Here are some common ways to handle missing values:

1. Ignore tuples

This is usually done when category labels are missing (assuming that mining tasks involve classification), and this method is not very effective unless tuples have multiple attribute missing values. Its performance is particularly poor when the percentage of missing values per attribute varies greatly.

2. Fill in missing values manually

1 This method is time-consuming and may not work when the data set is large and many values are missing.

3. Fill the missing value with a global constant

Replace the missing attribute value with the same constant (such as "Unknown" or negative infinity). If all the missing values are replaced with "unknown", the miner may think they form an interesting concept because they all have the same value "unknown". Therefore, although this method is simple, it is unreliable to score 10 points.

4. Use the attribute mean of all samples belonging to the same class as the given tuple

For example, if customers are categorized according to credit_risk, the missing value in income is replaced with the average income of customers with the same credit for a given tuple.

5. Fill in missing values with the most likely values

It can be determined by regression, reasoning-based tools using Bayesian formalization or decision tree induction. For example, using the attributes of other customers in the data set, a decision tree can be constructed to predict the missing value of income.

Note: Missing values do not always mean data errors! ! ! ! ! ! !

2. Code implementation of missing value processing

class: The ` Imputer ` class provides basic strategies for handling missing values, such as replacing missing values with the mean, median, and mode of the row or column in which the missing values are located. This class is also compatible with different missing value encodings.

1. Fill the missing value with the mean value


import numpy as np

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

import numpy as np

from sklearn.preprocessing import Imputer
 
###1. Fill in missing values with mean value 
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])


X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X)) 
[[4.     2.    ]
 [6.     3.66666667]
 [7.     6.    ]]

2. The Imputer class also supports sparse matrices:


import scipy.sparse as sp
 
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
 
imp = Imputer(missing_values=0, strategy='mean', axis=0)
 
imp.fit(X)
 
 
X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
 
print(imp.transform(X_test))

# Note that here, the missing data is encoded as 0,  This method is very suitable when there are more missing data than observed data.  

Related articles: