On the Judgment and Trap of Null Value of nan in pandas
- 2021-10-13 08:01:40
- OfStack
pandas is based on numpy, so the null values nan and numpy. nan are equivalent. nan in numpy is not an empty object, it is actually an numpy. float64 object, so we can't mistake it for an empty object, so we can use bool (np. nan) to judge whether it is a null value, which is wrong.
How do we judge the null value in pandas, and what traps are we easy to fall into, that is, how can we not judge it?
How you can judge a single null object in pandas:
1. Use pd. isnull (), pd. isna ();
2. Use np. isnan ();
3. Using is expression;
4. Using in expression.
Can't be used to judge the way an pandas single null object:
1. It cannot be directly judged by the = = expression;
2. It cannot be judged directly by bool expression;
3. It cannot be judged directly by if statement.
Example:
import pandas as pd
import numpy as np
na=np.nan
# Ways you can use to judge null values
pd.isnull(na) # True
pd.isna(na) # True
np.isnan(na) # True
na is np.nan # True
na in [np.nan] # True
# Can't be directly used to judge the way, that is, the following results and we expect not to 1 Sample
na == np.nan # False
bool(na) # True
if na:
print('na is not null') # Output: na is not null
# You can't use it directly python Built-in function any And all
any([na]) # True
all([na]) #True
Summarize
numpy. nan is a non-null object of numpy. float64, so it can't be judged directly by bool expression, so it can't be judged by Boolean expression, such as if statement.
We can only judge the hollow value of pandas by the function of pandas or numpy and the expression of is, but not by the built-in function of any or all of python.
One strange point is that the null value in pandas can be judged by is expression, but it cannot be judged by = = expression. As we know, for is expression, if True is returned, it means that these two references point to the same memory object, that is, the memory address is 1, and the values of different references to the same object should be equal, so the 1-is expression is True, and the == expression is True.
However, this is obviously not the case for numpy. nan objects, because they can be judged by is expressions, that is, when is expressions are True, but = = expressions are False, which shows that although different numpy. nan variable references point to the same memory address, they have their own value attributes, and their values are different, so they cannot be judged by = =, which needs attention.
Supplement: Pandas+Numpy processing operations of null values in data: judgment, search, filling and deletion
In this paper, the processing operations of null values in data are sorted out, and the main contents are as follows:
For ease of description, the sample data in this article is defined as follows:
df = pd.DataFrame([[1, np.nan], [np.nan, 4], [5,6],[np.nan,7]],columns=["A","B"])
df # Define sample data df
Determine whether there is a null value in the data
pandas isnull () Function
df.isnull() # Return df Whether each element in the is empty or not is the same as df Size data box
df["A"].isnull() # Judge A Case of empty values in columns
df[["A","B"]].isnull() # Specify multiple columns for null value judgment. For this example, the following code has the same effect as df.isnull()
pandas notnull () Function
df.notnull() # Judge df Whether the elements in the No Null value
df["A"].isnull() # Judge A Non-null value cases in columns
df[["A","B"]].isnull() # Specify multiple columns for non-null value judgment. For this example, the following code has the same effect as df.notnull()
numpy np. isnan () Function
np.isnan(df) # Equivalent to df.isnull()
np.isnan(df["A"]) # Equivalent to df["A"].isnull()
np.isnan(df[["A","B"]]) # Equivalent to df[["A","B"]].isnull()
Count the number of null values/non-null values
df.isnull().sum() # Count the number of null values per column
df.notnull().sum() # Count the number of non-null values per column
df["A"].count() # A Column Non-empty quantity
df.count() # Count the number of non-null values for all columns
df.count(axis=1) # Number of non-null values per row, axis=1
df["A"].sum() # A Column Sum of element values
Filter data based on null values
# Screen out A All rows with empty columns
df[df.A.isnull()]
df[df["A"].isnull()]
# Screen out A All rows with non-empty columns
df[df.A.notnull()]
df[df["A"].notnull()]
# Screen out df Rows with null values in
df[df.isnull().values==True]
Find null index
np.where(np.isnan(df)) # df Row index and column index where the hollow value is located
np.where(np.isnan(df.A)) # df Medium A Index of the row where the column null value is located
Delete null dropna () function
df.dropna() # Delete rows with null values, default axis=0 By row, how=any Per row exists 1 The row deletion operation is performed if the value is null
df.dropna(axis=1) # Delete columns with null values
df.dropna(how="all") # Delete a specific row where all columns are null
df.dropna(how = "any") # Delete rows with null values
# Delete null values for specific columns
df.dropna(how="any",subset=["A"]) # Delete A Rows with null values in columns
df.dropna(how="any",subset=["A","B"]) # Delete A,B Column as long as there is 1 Rows with null columns
# The deletion operation is applied to the original data, and the original data is modified and replaced
df.dropna(how="all",subset=["A","B"],inplace=True) # Delete A,B Rows with empty columns , And replace the original data
Fill the null fillna () function
# Fill with the specified number
df.fillna(0) # Use 0 To fill in df Null value in
# Fill with the specified function statistics
df.fillna(df.mean()) # Use df To fill the null value with the average value of the data in
df.fillna(df.mean()["A"]) # Specify the use of A Column data mean to fill in df Hollow value
df.fillna(df.sum()) # Use df To fill in the null value with the sum of the data in
# Fill in with a dictionary
values = {'A': 0, 'B': 1} # A Column null values are used 0 Fill, B Column null values are used 1 Padding
df.fillna(value=values)
# Fills a null value with a specified string
df.fillna("unkown")
# Different filling methods { ' backfill', ' bfill', ' pad', ' ffill', None}
# The null value of each column is filled with the non-null value below its column
df.fillna(method="backfill")
df.fillna(method="bfill") # Same as backfill
# The null value of each column is filled with the non-null value above the column, and if there is no element above, the null value is kept
df.fillna(method="ffill")
df.fillna(method="pad") # Same as ffill
#limit Parameter sets the maximum number of null values to fill
df.fillna(0,limit=1) # Maximum fill per column 1 Null values, and null values beyond the range are still null
#inplace Parameter null whether to modify the original data df
df.fillna(0,inplace=True) # inplace For true That applies the modifications to the original data