On the Judgment and Trap of Null Value of nan in pandas

2021-10-13 08:01:40
OfStack

pandas is based on numpy, so the null values nan and numpy. nan are equivalent. nan in numpy is not an empty object, it is actually an numpy. float64 object, so we can't mistake it for an empty object, so we can use bool (np. nan) to judge whether it is a null value, which is wrong.

How do we judge the null value in pandas, and what traps are we easy to fall into, that is, how can we not judge it?

How you can judge a single null object in pandas:

1. Use pd. isnull (), pd. isna ();

2. Use np. isnan ();

3. Using is expression;

4. Using in expression.

Can't be used to judge the way an pandas single null object:

1. It cannot be directly judged by the = = expression;

2. It cannot be judged directly by bool expression;

3. It cannot be judged directly by if statement.

Example:


import pandas as pd
import numpy as np 
na=np.nan 
#  Ways you can use to judge null values 
pd.isnull(na) # True
pd.isna(na) # True
np.isnan(na) # True
na is np.nan # True
na in [np.nan] # True 
 
#  Can't be directly used to judge the way, that is, the following results and we expect not to 1 Sample 
na == np.nan # False
bool(na) # True
if na:
  print('na is not null') # Output: na is not null 
 
#  You can't use it directly python Built-in function any And all
any([na]) # True
all([na]) #True

Summarize

numpy. nan is a non-null object of numpy. float64, so it can't be judged directly by bool expression, so it can't be judged by Boolean expression, such as if statement.

We can only judge the hollow value of pandas by the function of pandas or numpy and the expression of is, but not by the built-in function of any or all of python.

One strange point is that the null value in pandas can be judged by is expression, but it cannot be judged by = = expression. As we know, for is expression, if True is returned, it means that these two references point to the same memory object, that is, the memory address is 1, and the values of different references to the same object should be equal, so the 1-is expression is True, and the == expression is True.

However, this is obviously not the case for numpy. nan objects, because they can be judged by is expressions, that is, when is expressions are True, but = = expressions are False, which shows that although different numpy. nan variable references point to the same memory address, they have their own value attributes, and their values are different, so they cannot be judged by = =, which needs attention.

Supplement: Pandas+Numpy processing operations of null values in data: judgment, search, filling and deletion

In this paper, the processing operations of null values in data are sorted out, and the main contents are as follows:

For ease of description, the sample data in this article is defined as follows:


df = pd.DataFrame([[1, np.nan], [np.nan, 4], [5,6],[np.nan,7]],columns=["A","B"])
df # Define sample data df

Determine whether there is a null value in the data

pandas isnull () Function


df.isnull()  # Return df Whether each element in the is empty or not is the same as df Size data box  
df["A"].isnull() # Judge A Case of empty values in columns  
df[["A","B"]].isnull() #  Specify multiple columns for null value judgment. For this example, the following code has the same effect as df.isnull()

pandas notnull () Function


df.notnull()  # Judge df Whether the elements in the   No   Null value  
df["A"].isnull() # Judge A Non-null value cases in columns  
df[["A","B"]].isnull() #  Specify multiple columns for non-null value judgment. For this example, the following code has the same effect as df.notnull()

numpy np. isnan () Function


np.isnan(df)  #  Equivalent to df.isnull() 
np.isnan(df["A"])  #  Equivalent to  df["A"].isnull() 
np.isnan(df[["A","B"]]) #  Equivalent to  df[["A","B"]].isnull()

Count the number of null values/non-null values


df.isnull().sum() #  Count the number of null values per column  
df.notnull().sum() #  Count the number of non-null values per column  
 
df["A"].count()   # A Column   Non-empty quantity 
df.count()     #  Count the number of non-null values for all columns 
df.count(axis=1)  #  Number of non-null values per row, axis=1 
df["A"].sum()   # A Column   Sum of element values

Filter data based on null values


#  Screen out A All rows with empty columns 
df[df.A.isnull()]  
df[df["A"].isnull()] 
 
#  Screen out A All rows with non-empty columns 
df[df.A.notnull()]  
df[df["A"].notnull()]    
 
#  Screen out df Rows with null values in 
df[df.isnull().values==True]

Find null index


np.where(np.isnan(df))  # df Row index and column index where the hollow value is located  
np.where(np.isnan(df.A))  # df Medium A Index of the row where the column null value is located

Delete null dropna () function


df.dropna()  #  Delete rows with null values, default axis=0 By row, how=any Per row exists 1 The row deletion operation is performed if the value is null  
df.dropna(axis=1) #  Delete columns with null values  
df.dropna(how="all") #  Delete a specific row where all columns are null  
df.dropna(how = "any")  #  Delete rows with null values 
 
#  Delete null values for specific columns  
df.dropna(how="any",subset=["A"]) #  Delete A Rows with null values in columns 
 df.dropna(how="any",subset=["A","B"]) #  Delete A,B Column as long as there is 1 Rows with null columns 
 
# The deletion operation is applied to the original data, and the original data is modified and replaced 
 df.dropna(how="all",subset=["A","B"],inplace=True) #  Delete A,B Rows with empty columns , And replace the original data

Fill the null fillna () function


#  Fill with the specified number 
df.fillna(0)  #  Use 0 To fill in df Null value in 
 
#  Fill with the specified function statistics 
df.fillna(df.mean()) #  Use df To fill the null value with the average value of the data in  
df.fillna(df.mean()["A"])  # Specify the use of A Column data mean to fill in df Hollow value  
df.fillna(df.sum())  #  Use df To fill in the null value with the sum of the data in 
 
#  Fill in with a dictionary 
values = {'A': 0, 'B': 1}  # A Column null values are used 0 Fill, B Column null values are used 1 Padding 
df.fillna(value=values)  
 
#  Fills a null value with a specified string 
df.fillna("unkown")
 
#  Different filling methods { ' backfill',  ' bfill',  ' pad',  ' ffill', None}
#  The null value of each column is filled with the non-null value below its column 
df.fillna(method="backfill") 
df.fillna(method="bfill")  #  Same as backfill
#  The null value of each column is filled with the non-null value above the column, and if there is no element above, the null value is kept 
df.fillna(method="ffill") 
df.fillna(method="pad")   #  Same as  ffill
 
#limit Parameter sets the maximum number of null values to fill 
df.fillna(0,limit=1) #  Maximum fill per column 1 Null values, and null values beyond the range are still null 
 
#inplace Parameter null whether to modify the original data df
df.fillna(0,inplace=True) # inplace For true That applies the modifications to the original data