Python reads the csv file to remove a column before writing a new file instance
- 2020-06-23 00:41:59
- OfStack
The problem was solved in two ways, both existing online solutions.
Scene description:
One data file, saved as text, now has three columns user_id,plan_id,mobile_id. The goal is to get a new file with only mobile_id,plan_id.
The solution
Scenario 1: Use python's open file to write the file directly by passing the data through, and the for loop processes the data and writes to the new file.
The code is as follows:
def readwrite1( input_file,output_file):
f = open(input_file, 'r')
out = open(output_file,'w')
print (f)
for line in f.readlines():
a = line.split(",")
x=a[0] + "," + a[1]+"\n"
out.writelines(x)
f.close()
out.close()
Scheme 2: Read data from pandas to DataFrame and then do data segmentation, and write directly to the new file with DataFrame's write function
The code is as follows:
def readwrite2(input_file,output_file): date_1=pd.read_csv(input_file,header=0,sep=',') date_1[['mobile', 'plan_id']].to_csv(output_file, sep=',', header=True,index=False)
From a code perspective, the pandas logic is clearer.
Let's take a look at the efficiency of execution.
def getRunTimes( fun ,input_file,output_file):
begin_time=int(round(time.time() * 1000))
fun(input_file,output_file)
end_time=int(round(time.time() * 1000))
print(" Read and write running time: ",(end_time-begin_time),"ms")
getRunTimes(readwrite1,input_file,output_file) # Direct data collection
getRunTimes(readwrite2,input_file,output_file1) # use dataframe Read and write data
Read and write runtime: 976 ms
Read and write runtime: 777 ms
input_file has about 270,000 data. The efficiency of dataframe is still 1 point faster than the cycle efficiency of for. If the data volume is larger, is the effect more obvious?
Try increasing the number of input_file records under the interview. The results are as follows
input_file | readwrite1 | readwrite2 |
27W | 976 | 777 |
55W | 1989 | 1509 |
110W | 4312 | 3158 |
Judging from the above test results, the efficiency of dataframe is about 30% higher.