When pandas reads the CSV file view and modify the data type format of each column

  • 2021-07-10 20:17:18
  • OfStack

Let's introduce the data type format of each column when pandas reads CSV file. The specific contents are as follows:

When we adjust bug, we will often check and modify the data type of pandas column data. Today, we will summarize 1:

1. View:

Numpy and Pandas are viewed in a slightly different way, one is dtype and one is dtypes


print(Array.dtype)
# Output int64
print(df.dtypes)
# Output Df Data format for all columns below  a:int64,b:int64

STEP 2 Modifications


import pandas as pd
import numpy as np
df = pd.read_csv('000917.csv',encoding='gbk')
df = df[df[' Rise and fall ']!='None']
df[' Rise and fall '] = df[' Rise and fall '].astype(np.float64)

print(df[df[' Rise and fall ']>5])

ps: Change the data type of a column in Pandas

Let's look at a very simple example first:


a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

Is there any way to convert a column to the appropriate type? For example, in the above example, how do I turn columns 2 and 3 into floating-point numbers? Is there a way to specify a type when converting data to DataFrame format? Or create DataFrame and then change the type of each column in some way? Ideally, you want to do this dynamically, because you can have hundreds of columns, and it's too cumbersome to specify which columns are which type. You can assume that each column contains the same type of value.

Solution

The methods that can be used are simply listed as follows:

In the case of creating DataFrame,

If you want to create 1 DataFrame, you can specify the type directly through the dtype parameter:


df = pd.DataFrame(a, dtype='float') # Example 1
df = pd.DataFrame(data=d, dtype=np.int8) # Example 2
df = pd.read_csv("somefile.csv", dtype = {'column_name' : str})

For single column or Series

Here is an example of a string Seriess whose dtype is object:


>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0     1
1     2
2    4.7
3  pandas
4    10
dtype: object

Convert to a numeric value using to_numeric. By default, it cannot handle the alphabetic string 'pandas':


>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

You can cast an invalid value to NaN as follows:


>>> pd.to_numeric(s, errors='coerce')
0   1.0
1   2.0
2   4.7
3   NaN
4  10.0
dtype: float64

If an invalid value is encountered, the third option is to ignore the operation:


>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

For multiple columns or the entire DataFrame
If you want to apply this operation to multiple columns, processing every 1 column in turn is very cumbersome, so you can use DataFrame. apply to process every 1 column.

For an DataFrame:


>>> a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
>>> df = pd.DataFrame(a, columns=['col1','col2','col3'])
>>> df
 col1 col2 col3
0  a 1.2  4.2
1  b  70 0.03
2  x  5   0

Then you can write:


df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)

Then 'col2' and 'col3' have the float64 type as required.

However, you may not know which columns can be reliably converted to numeric types. In this case, set the parameters:


import pandas as pd
import numpy as np
df = pd.read_csv('000917.csv',encoding='gbk')
df = df[df[' Rise and fall ']!='None']
df[' Rise and fall '] = df[' Rise and fall '].astype(np.float64)

print(df[df[' Rise and fall ']>5])
0

This function will then be applied to the entire DataFrame, columns that can be converted to numeric types will be converted, and columns that cannot (for example, they contain non-numeric strings or dates) will be left alone.

Additionally, pd.to_datetime and pd.to_timedelta convert data to date and time stamps.

Soft conversion-automatic type inference

Version 0.21. 0 introduces the infer_objects () method to convert DataFrame columns with object data types to more specific types.

For example, create an DataFrame with two columns of object types, where one holds an integer and the other holds an integer string:


import pandas as pd
import numpy as np
df = pd.read_csv('000917.csv',encoding='gbk')
df = df[df[' Rise and fall ']!='None']
df[' Rise and fall '] = df[' Rise and fall '].astype(np.float64)

print(df[df[' Rise and fall ']>5])
1

You can then change the type of column 'a' to int64 using infer_objects ():


import pandas as pd
import numpy as np
df = pd.read_csv('000917.csv',encoding='gbk')
df = df[df[' Rise and fall ']!='None']
df[' Rise and fall '] = df[' Rise and fall '].astype(np.float64)

print(df[df[' Rise and fall ']>5])
2

Because the value of 'b' is a string, not an integer, 'b' 1 is retained.

astype cast

If you try to force two columns to be converted to an integer type, you can use df. astype (int).

Examples are as follows:


a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
 one two three
0  a 1.2  4.2
1  b  70 0.03
2  x  5   0
df.dtypes
Out[17]: 
one   object
two   object
three  object
df[['two', 'three']] = df[['two', 'three']].astype(float)
df.dtypes
Out[19]: 
one    object
two   float64
three  float64

Summarize

Above is this site to introduce you to read pandas CSV file to see the data type format of the modified columns, I hope to help you, if there is any doubt welcome to give me a message, this site will reply to you in time!


Related articles: