Four different ways to read files using Python
- 2020-06-01 10:14:17
- OfStack
preface
Everyone knows that Python can read files in a variety of ways, but when it comes to reading a large file, different ways can have different effects. Let's take a look at the details below.
scenario
Read 1 large 2.9G file line by line
CPU i7 6820HQ RAM 32G
methods
Split the string once for each line read
The following methods all use with... The as method opens the file.
The with statement is appropriate for accessing resources, ensuring that necessary "cleanup" operations are performed regardless of whether exceptions occur during use, freeing resources such as automatic closing of files after use, automatic acquisition and release of locks in threads, and so on.
Method 1 the most common way to read a file
with open(file, 'r') as fh:
for line in fh.readlines():
line.split("|")
Results: time: 15.4346568584 seconds
The system monitor shows that memory has soared from 4.8G 1 to 8.4G, and fh.readlines () stores all the rows it reads into memory, a method suitable for small files.
Method 2
with open(file, 'r') as fh:
line = fh.readline()
while line:
line.split("|")
Running result: 22.3531990051 seconds
There is little change in memory, because only 1 row of data is accessed in memory, but the time is obviously longer than the previous one, which is not efficient for further data processing.
Methods 3
with open(file) as fh:
for line in fh:
line.split("|")
Results of operation: 13.9956979752 seconds
Memory is almost unchanged and faster than method 2.
for line in fh treats the file object fh as iterable, it automatically USES the cached IO and memory management, so you don't have to worry about large files. This is very pythonic way!
Method 4 fileinput module
for line in fileinput.input(file):
line.split("|")
Running result: the time was 26.1103110313 seconds
Memory has been increased by 200-300 MB, which is the slowest of the above.
conclusion
The above methods are for reference only, the recognized large file reading method or 3 best. But it depends on the performance of the machine and the complexity of processing the data.