Discussion on the Speed Comparison of Pandas dataframe Data Processing Methods

  • 2021-10-24 23:23:39
  • OfStack

Data modification is mainly based on adding, deleting and modifying differences. Here, we compare the huge differences in data processing time of several writing methods.

The data volume is about 5 million lines of data, and the file size is 100M.

1.iloc

iloc is an extremely slow way to write. Here we recycle every 1 line in each csv file with iloc processing. The sample code is as follows:


for index in range(len(df)):
   df.iloc['attr'][index] = xxx

It takes about 5 hours to process 5 million rows of data using this method, which is really slow.

2.at

Compared with iloc, at has a great performance improvement, and it is also for loop processing. The example code is as follows:


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']

In my program at and iloc can be common, with at, the speed of the program will be greatly improved, about 10 minutes, but not enough.

3.apply(lambda x:...)

I want to say apply because I think the for loop is too slow, and I want to optimize the program on the loop. Then, some people on the Internet said that apply can greatly improve the speed. However, after testing, it was found that in my program, using apply and for was similar, and the performance was 1.

4. Treat directly with series

This is the real way to optimize the for loop. Taking the above at program as an example, it can be rewritten as:


sum_positive += df['attr'][df.attr > 0].sum()
sum_negative += df['attr'][df.attr < 0].sum()

Changing the programs to series processing is much faster, and the last 5 million lines of data can be run in about 37 seconds, which basically meets expectations.

Here are two sentences about dataframe attribute filtering, that is, df. attr above > 0 This 1 part. First of all, pandas is really powerful and convenient.

Secondly, when we filter attributes, we should not modify attributes, but modify the following numbers. For example, we should not write this:

float(df.attr ) > 0, but write this:

df.attr > str (0), because df. attr cannot be moved as an attribute.

Supplement: DataFrame Single Data Extraction Efficiency and Modification Efficiency in pandas

Objectives

When using pandas to process financial data and modeling, it is often necessary to follow DataFrame data in time series order, read the data of specific position, judge or modify it. Empirically, this operation is much slower than direct 2D list or np. array format data, which may be due to the lookup at index and columns levels (both dictionaries are not continuous arrays, and it takes time to find and locate each time) and the memory layout of data in DataFrame, so there is an opportunity to study it in depth later.

Here, a set of numerical experiments are done to compare the efficiency of several methods.

Generate data

A 2-dimensional array of random numbers as DataFrame data, without losing the generality, and the column name, row name set to the order of the tag string.


import numpy as np
import pandas as pd

from copy import deepcopy
from time import time

np.random.seed(20000)
I = 900
df = pd.DataFrame(np.random.standard_normal((I, I)),
   columns=['c'+str(_) for _ in range(I)],
         index=['i'+str(_) for _ in range(I)])

Then the value position is randomly generated from the limited range. For the convenience of comparison, the random coordinates are corresponding to the string name


columns_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)
index_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)

columns_str = ['c'+str(_) for _ in columns_num]
index_str = ['i'+str(_) for _ in index_num]

Read test

Firstly, in the traditional method, the names in columns and index are directly used for location


t0 = time()
for m in columns_str:
  for n in index_str:
    c = df[m][n]
print(time()-t0)

6.789840459823608

First columns column name and then take row coordinates in values, which is 1% faster


t0 = time()
for m in columns_str:
  for n in index_num:
    c = df[m].values[n]
print(time()-t0)

1.9697318077087402

The method and speed of loc are similar to those of directly taking names in columns and index


t0 = time()
for m in columns_str:
  for n in index_str:
    c = df.loc[n, m]
print(time()-t0)

5.661889314651489

The at method is 1 point faster than loc, after all, loc can be sliced


t0 = time()
for m in columns_str:
  for n in index_str:
    c = df.at[m, n]
print(time()-t0)

3.3770089149475098

Assuming that we know the specific horizontal and vertical coordinates, we will compare:

It is still slow to take values, and it seems that it takes time to take values from df every time


t0 = time()
for m in columns_num:
  for n in index_num:
    c = df.values[n][m]
print(time()-t0)

6.041872024536133

Try iloc for 1 time, it makes no difference


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
0

iat for comparison, the improvement is not big, a little disappointed


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
1

Finally, the most efficient method is to take a 2-dimensional array first and then locate it


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
2

Modify the test

Repeat the process and change the corresponding value to 0 as a simple test method. Don't forget to back up the original data

Take the name in columns and index to locate


df_backup = deepcopy(df)
t0 = time()
for m in columns_str:
  for n in index_str:
    df_backup[m][n] = 0.0
print(time()-t0)

41.99269938468933

Take row coordinates in values after column name of columns


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
4

loc method


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
5

The at method is so much faster than loc in modifying values


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
6

In the values change, is also good, and read close, it seems to be in each extraction of values time-consuming


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
7

iloc method


df_backup = deepcopy(df)
t0 = time()
for m in columns_num:
  for n in index_num:
    df.iloc[n, m] = 0.0
print(time()-t0)

122.33384037017822

iat method


for i in range(len(df)):
  if df.at[i,'attr'] > 0:
    sum_positive += df.at[i,'attr']
  else:
    sum_negetive += df.at[i,'sttr']
9

Take a 2-dimensional array to reposition


df_backup = deepcopy(df)
t0 = time()
b = df.values
for m in columns_num:
  for n in index_num:
    c = b[n][m]
print(time()-t0)

0.4298992156982422

Summarize

Efficiency is definitely the best value directly, so make a record of this systematic comparison. The code is a bit wordy, but it is convenient to copy the experiment. I am still used to using the second method in modeling-level code, mainly because of code readability, maintenance and modification. The code will tell me what is here on key, which is intuitive and easy to read.

In the past, in order to improve the efficiency of code operation, I wrote to extract 2-dimensional arrays first, but columns is very difficult, and I need to translate it once when I reread it. Of course, you can also write data into classes, but I feel that it is not easy to integrate with pandas, and there is no good solution from modeling and research efficiency. Later, we will find time to study the internal mechanism of DataFrame.


Related articles: