python quickly dumps a large txt file as an instance of csv

  • 2021-01-06 00:40:08
  • OfStack

To convert txt files to csv, there are Spaces between txt and csv. To convert csv files to csv, you need to convert the Spaces to commas.


import numpy as np
import pandas as pd

data_txt = np.loadtxt('datas_train.txt')
data_txtDF = pd.DataFrame(data_txt)
data_txtDF.to_csv('datas_train.csv',index=False)

datas_train.txt contains less than 100MB and 560W rows, and is converted within 3 minutes.

Then I replaced a 5600W line with a total of 1.2G txt text, using the above code conversion, the computer directly stuck.

The reason is that the above code will load all the txt into memory and then convert it, which will cause the computer to run out of memory.

Then I came up with the method of cutting data. The specific implementation is as follows:


import numpy as np
import pandas as pd


train_data = pd.read_table('big_data.txt',iterator=True,header=None)

while True:
 try:
  chunk = train_data.get_chunk(5600000)
  chunk.columns = ['user_id','spu_id','buy_or_not','date']
  chunk.to_csv('big_data111.csv', mode='a',header=False,index = None)
 except Exception as e:
  break

Here I break the data into small blocks, each of which has 560W rows, and it can all be loaded in 11 loads, which is very fast. It took about 5 minutes.

Note that the argument in get_chunk() represents the number of lines in the file, not bytes.


Related articles: