# Discussion on the Speed Comparison of Pandas dataframe Data Processing Methods

- 2021-10-24 23:23:39
- OfStack

Data modification is mainly based on adding, deleting and modifying differences. Here, we compare the huge differences in data processing time of several writing methods.

The data volume is about 5 million lines of data, and the file size is 100M.

## 1.iloc

iloc is an extremely slow way to write. Here we recycle every 1 line in each csv file with iloc processing. The sample code is as follows:

```
for index in range(len(df)):
df.iloc['attr'][index] = xxx
```

It takes about 5 hours to process 5 million rows of data using this method, which is really slow.

## 2.at

Compared with iloc, at has a great performance improvement, and it is also for loop processing. The example code is as follows:

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

In my program at and iloc can be common, with at, the speed of the program will be greatly improved, about 10 minutes, but not enough.

## 3.apply(lambda x:...)

I want to say apply because I think the for loop is too slow, and I want to optimize the program on the loop. Then, some people on the Internet said that apply can greatly improve the speed. However, after testing, it was found that in my program, using apply and for was similar, and the performance was 1.

## 4. Treat directly with series

This is the real way to optimize the for loop. Taking the above at program as an example, it can be rewritten as:

```
sum_positive += df['attr'][df.attr > 0].sum()
sum_negative += df['attr'][df.attr < 0].sum()
```

Changing the programs to series processing is much faster, and the last 5 million lines of data can be run in about 37 seconds, which basically meets expectations.

Here are two sentences about dataframe attribute filtering, that is, df. attr above > 0 This 1 part. First of all, pandas is really powerful and convenient.

Secondly, when we filter attributes, we should not modify attributes, but modify the following numbers. For example, we should not write this:

float(df.attr ) > 0, but write this:

df.attr > str (0), because df. attr cannot be moved as an attribute.

Supplement: DataFrame Single Data Extraction Efficiency and Modification Efficiency in pandas

## Objectives

When using pandas to process financial data and modeling, it is often necessary to follow DataFrame data in time series order, read the data of specific position, judge or modify it. Empirically, this operation is much slower than direct 2D list or np. array format data, which may be due to the lookup at index and columns levels (both dictionaries are not continuous arrays, and it takes time to find and locate each time) and the memory layout of data in DataFrame, so there is an opportunity to study it in depth later.

Here, a set of numerical experiments are done to compare the efficiency of several methods.

## Generate data

A 2-dimensional array of random numbers as DataFrame data, without losing the generality, and the column name, row name set to the order of the tag string.

```
import numpy as np
import pandas as pd
from copy import deepcopy
from time import time
np.random.seed(20000)
I = 900
df = pd.DataFrame(np.random.standard_normal((I, I)),
columns=['c'+str(_) for _ in range(I)],
index=['i'+str(_) for _ in range(I)])
```

Then the value position is randomly generated from the limited range. For the convenience of comparison, the random coordinates are corresponding to the string name

```
columns_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)
index_num = np.floor(np.random.uniform(0, 1, I) * I).astype(int)
columns_str = ['c'+str(_) for _ in columns_num]
index_str = ['i'+str(_) for _ in index_num]
```

## Read test

Firstly, in the traditional method, the names in columns and index are directly used for location

```
t0 = time()
for m in columns_str:
for n in index_str:
c = df[m][n]
print(time()-t0)
6.789840459823608
```

First columns column name and then take row coordinates in values, which is 1% faster

```
t0 = time()
for m in columns_str:
for n in index_num:
c = df[m].values[n]
print(time()-t0)
1.9697318077087402
```

The method and speed of loc are similar to those of directly taking names in columns and index

```
t0 = time()
for m in columns_str:
for n in index_str:
c = df.loc[n, m]
print(time()-t0)
5.661889314651489
```

The at method is 1 point faster than loc, after all, loc can be sliced

```
t0 = time()
for m in columns_str:
for n in index_str:
c = df.at[m, n]
print(time()-t0)
3.3770089149475098
```

Assuming that we know the specific horizontal and vertical coordinates, we will compare:

It is still slow to take values, and it seems that it takes time to take values from df every time

```
t0 = time()
for m in columns_num:
for n in index_num:
c = df.values[n][m]
print(time()-t0)
6.041872024536133
```

Try iloc for 1 time, it makes no difference

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

0
iat for comparison, the improvement is not big, a little disappointed

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

1
Finally, the most efficient method is to take a 2-dimensional array first and then locate it

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

2
## Modify the test

Repeat the process and change the corresponding value to 0 as a simple test method. Don't forget to back up the original data

Take the name in columns and index to locate

```
df_backup = deepcopy(df)
t0 = time()
for m in columns_str:
for n in index_str:
df_backup[m][n] = 0.0
print(time()-t0)
41.99269938468933
```

Take row coordinates in values after column name of columns

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

4
loc method

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

5
The at method is so much faster than loc in modifying values

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

6
In the values change, is also good, and read close, it seems to be in each extraction of values time-consuming

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

7
iloc method

```
df_backup = deepcopy(df)
t0 = time()
for m in columns_num:
for n in index_num:
df.iloc[n, m] = 0.0
print(time()-t0)
122.33384037017822
```

iat method

```
for i in range(len(df)):
if df.at[i,'attr'] > 0:
sum_positive += df.at[i,'attr']
else:
sum_negetive += df.at[i,'sttr']
```

9
Take a 2-dimensional array to reposition

```
df_backup = deepcopy(df)
t0 = time()
b = df.values
for m in columns_num:
for n in index_num:
c = b[n][m]
print(time()-t0)
0.4298992156982422
```

## Summarize

Efficiency is definitely the best value directly, so make a record of this systematic comparison. The code is a bit wordy, but it is convenient to copy the experiment. I am still used to using the second method in modeling-level code, mainly because of code readability, maintenance and modification. The code will tell me what is here on key, which is intuitive and easy to read.

In the past, in order to improve the efficiency of code operation, I wrote to extract 2-dimensional arrays first, but columns is very difficult, and I need to translate it once when I reread it. Of course, you can also write data into classes, but I feel that it is not easy to integrate with pandas, and there is no good solution from modeling and research efficiency. Later, we will find time to study the internal mechanism of DataFrame.