Using Python to quickly open a million line super large Excel file

  • 2021-09-12 01:40:33
  • OfStack

A classmate in Zhihu asked for help and said that when he tried to open an excel file of about 20M, whether he used read_excel of pandas or directly used xlrd or openpyxl module, the speed was unbearably slow and took about 1 minute.

Is this really going to happen? The first feeling is that this classmate did not set read-only mode when using openpyxl module. To facilitate testing, first generate an excel file with 1 million lines of data with the following code.


>>> from openpyxl import Workbook
>>> wb = Workbook()
>>> sh = wb.active
>>> sh.append(['id', ' Language ', ' Mathematics ', ' English ', ' Physics '])
>>> for i in range(1000000): #  Write 100 Ten thousand lines of data 
	sh.append([i+1, 90, 100, 95, 99])

	
>>> wb.save(r'd:\bigxlsx.xlsx')
>>> import os
>>> os.path.getsize(r'd:\bigxlsx.xlsx') #  File size: 20M Byte 
20230528

Next, we define a function to open files using openpyxl module, and examine the time consumption of closing and opening read-only mode respectively.


>>> from openpyxl import load_workbook
>>> import time
>>> def read_xlsx(read_only):
	t0 = time.time()
	wb = load_workbook(r'd:\bigxlsx.xlsx', read_only=read_only)
	t1 = time.time()
	print(wb.sheetnames)
	print(sh.cell(row=1, column=1).value)
	print(sh.cell(row=100, column=3).value)
	print(' Time consuming %0.3f Seconds '%(t1-t0))

	
>>> read_xlsx(True)
['Sheet']
id
100
 Time consuming 0.404 Seconds 
>>> read_xlsx(False)
['Sheet']
id
100
 Time consuming 67.817 Seconds 

Run the test, and sure enough, if you don't turn on read-only, it really takes more than 1 minute, while if you use read-only mode, it only takes 0.4 seconds.

However, don't be too happy. openpyxl module doesn't provide the function of reading all data into one data structure like pandas.read_excel (), and can only read data after navigating to rows, columns or grids. To use the openpyxl module to read all data into an array or DataFrame, you need to traverse all the rows and columns, which is still a very time-consuming operation.

So, does pandas.read_excel () also support read-only mode? Unfortunately, read_excel () does not have a parameter like read_only. Although read_excel () can accept file paths, file objects, file-like objects, and even binary data, it takes about 80 seconds for read_excel () to parse these 1 million lines of data even if the file contents are passed in. The following code verifies this 1 point.


>>> import pandas as pd
>>> def read_excel_by_pandas():	
	with open(r'd:\bigxlsx.xlsx', 'rb') as fp:
		content = fp.read()
		t0 = time.time()
		df = pd.read_excel(content, engine='openpyxl')
		t1 = time.time()
	print(df.head())
	print(df.tail())
	print(' Time consuming %0.3f Seconds '%(t1-t0))

	
>>> read_excel_by_pandas()
  id  Language    Mathematics   English   Physics 
0  1 90 100 95 99
1  2 90 100 95 99
2  3 90 100 95 99
3  4 90 100 95 99
4  5 90 100 95 99
       id  Language    Mathematics   English   Physics 
999995  999996 90 100 95 99
999996  999997 90 100 95 99
999997  999998 90 100 95 99
999998  999999 90 100 95 99
999999 1000000 90 100 95 99
 Time consuming 81.369 Seconds 

Conclusion: When dealing with oversized Excel files, using the read-only mode of openpyxl module, you can quickly open and obtain the data of the specified grid, but don't try to read all the data into your own defined data structure, which will take a long time. There is nothing pandas can do about it.


Related articles: