Python reads the csv file to remove a column before writing a new file instance

  • 2020-06-23 00:41:59
  • OfStack

The problem was solved in two ways, both existing online solutions.

Scene description:

One data file, saved as text, now has three columns user_id,plan_id,mobile_id. The goal is to get a new file with only mobile_id,plan_id.

The solution

Scenario 1: Use python's open file to write the file directly by passing the data through, and the for loop processes the data and writes to the new file.

The code is as follows:


def readwrite1( input_file,output_file):
 f = open(input_file, 'r')
 out = open(output_file,'w')
 print (f)
 for line in f.readlines():
 a = line.split(",")
 x=a[0] + "," + a[1]+"\n"
 out.writelines(x)
 f.close()
 out.close()

Scheme 2: Read data from pandas to DataFrame and then do data segmentation, and write directly to the new file with DataFrame's write function

The code is as follows:


def readwrite2(input_file,output_file): date_1=pd.read_csv(input_file,header=0,sep=',') date_1[['mobile', 'plan_id']].to_csv(output_file, sep=',', header=True,index=False) 

From a code perspective, the pandas logic is clearer.

Let's take a look at the efficiency of execution.


def getRunTimes( fun ,input_file,output_file):
 begin_time=int(round(time.time() * 1000))
 fun(input_file,output_file)
 end_time=int(round(time.time() * 1000))
 print(" Read and write running time: ",(end_time-begin_time),"ms")

getRunTimes(readwrite1,input_file,output_file) # Direct data collection 
getRunTimes(readwrite2,input_file,output_file1) # use dataframe Read and write data 

Read and write runtime: 976 ms

Read and write runtime: 777 ms

input_file has about 270,000 data. The efficiency of dataframe is still 1 point faster than the cycle efficiency of for. If the data volume is larger, is the effect more obvious?

Try increasing the number of input_file records under the interview. The results are as follows

input_file readwrite1 readwrite2
27W 976 777
55W 1989 1509
110W 4312 3158

Judging from the above test results, the efficiency of dataframe is about 30% higher.


Related articles: