Discussion on the Speed Comparison of Pandas dataframe Data Processing Methods
- 2021-10-24 23:23:39
- OfStack
Data modification is mainly based on adding, deleting and modifying differences. Here, we compare the huge differences in data processing time of several writing methods.
The data volume is about 5 million lines of data, and the file size is 100M.
1.iloc
iloc is an extremely slow way to write. Here we recycle every 1 line in each csv file with iloc processing. The sample code is as follows:
for index in range(len(df)):
df.iloc['attr'][index] = xxx
It takes about 5 hours to process 5 million rows of data using this method, which is really slow.
2.at
Compared with iloc, at has a great performance improvement, and it is also for loop processing. The example code is as follows:
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
In my program at and iloc can be common, with at, the speed of the program will be greatly improved, about 10 minutes, but not enough.
3.apply(lambda x:...)
I want to say apply because I think the for loop is too slow, and I want to optimize the program on the loop. Then, some people on the Internet said that apply can greatly improve the speed. However, after testing, it was found that in my program, using apply and for was similar, and the performance was 1.
4. Treat directly with series
This is the real way to optimize the for loop. Taking the above at program as an example, it can be rewritten as:
sum_positive += df['attr'][df.attr > 0].sum()
sum_negative += df['attr'][df.attr < 0].sum()
Changing the programs to series processing is much faster, and the last 5 million lines of data can be run in about 37 seconds, which basically meets expectations.
Here are two sentences about dataframe attribute filtering, that is, df. attr above > 0 This 1 part. First of all, pandas is really powerful and convenient.
Secondly, when we filter attributes, we should not modify attributes, but modify the following numbers. For example, we should not write this:
float(df.attr ) > 0, but write this:
df.attr > str (0), because df. attr cannot be moved as an attribute.
Supplement: DataFrame Single Data Extraction Efficiency and Modification Efficiency in pandas
Objectives
When using pandas to process financial data and modeling, it is often necessary to follow DataFrame data in time series order, read the data of specific position, judge or modify it. Empirically, this operation is much slower than direct 2D list or np. array format data, which may be due to the lookup at index and columns levels (both dictionaries are not continuous arrays, and it takes time to find and locate each time) and the memory layout of data in DataFrame, so there is an opportunity to study it in depth later.
Here, a set of numerical experiments are done to compare the efficiency of several methods.
Generate data
A 2-dimensional array of random numbers as DataFrame data, without losing the generality, and the column name, row name set to the order of the tag string.
import numpy as np
import pandas as pd
from copy import deepcopy
from time import time
np.random.seed(20000)
I = 900
df = pd.DataFrame(np.random.standard_normal((I, I)),
columns=['c'+str(_) for _ in range(I)],
index=['i'+str(_) for _ in range(I)])
Then the value position is randomly generated from the limited range. For the convenience of comparison, the random coordinates are corresponding to the string name
columns_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)
index_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)
columns_str = ['c'+str(_) for _ in columns_num]
index_str = ['i'+str(_) for _ in index_num]
Read test
Firstly, in the traditional method, the names in columns and index are directly used for location
t0 = time()
for m in columns_str:
for n in index_str:
c = df[m][n]
print(time()-t0)
6.789840459823608
First columns column name and then take row coordinates in values, which is 1% faster
t0 = time()
for m in columns_str:
for n in index_num:
c = df[m].values[n]
print(time()-t0)
1.9697318077087402
The method and speed of loc are similar to those of directly taking names in columns and index
t0 = time()
for m in columns_str:
for n in index_str:
c = df.loc[n, m]
print(time()-t0)
5.661889314651489
The at method is 1 point faster than loc, after all, loc can be sliced
t0 = time()
for m in columns_str:
for n in index_str:
c = df.at[m, n]
print(time()-t0)
3.3770089149475098
Assuming that we know the specific horizontal and vertical coordinates, we will compare:
It is still slow to take values, and it seems that it takes time to take values from df every time
t0 = time()
for m in columns_num:
for n in index_num:
c = df.values[n][m]
print(time()-t0)
6.041872024536133
Try iloc for 1 time, it makes no difference
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
0
iat for comparison, the improvement is not big, a little disappointed
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
1
Finally, the most efficient method is to take a 2-dimensional array first and then locate it
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
2
Modify the test
Repeat the process and change the corresponding value to 0 as a simple test method. Don't forget to back up the original data
Take the name in columns and index to locate
df_backup = deepcopy(df)
t0 = time()
for m in columns_str:
for n in index_str:
df_backup[m][n] = 0.0
print(time()-t0)
41.99269938468933
Take row coordinates in values after column name of columns
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
4
loc method
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
5
The at method is so much faster than loc in modifying values
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
6
In the values change, is also good, and read close, it seems to be in each extraction of values time-consuming
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
7
iloc method
df_backup = deepcopy(df)
t0 = time()
for m in columns_num:
for n in index_num:
df.iloc[n, m] = 0.0
print(time()-t0)
122.33384037017822
iat method
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
9
Take a 2-dimensional array to reposition
df_backup = deepcopy(df)
t0 = time()
b = df.values
for m in columns_num:
for n in index_num:
c = b[n][m]
print(time()-t0)
0.4298992156982422
Summarize
Efficiency is definitely the best value directly, so make a record of this systematic comparison. The code is a bit wordy, but it is convenient to copy the experiment. I am still used to using the second method in modeling-level code, mainly because of code readability, maintenance and modification. The code will tell me what is here on key, which is intuitive and easy to read.
In the past, in order to improve the efficiency of code operation, I wrote to extract 2-dimensional arrays first, but columns is very difficult, and I need to translate it once when I reread it. Of course, you can also write data into classes, but I feel that it is not easy to integrate with pandas, and there is no good solution from modeling and research efficiency. Later, we will find time to study the internal mechanism of DataFrame.