python batch query Chinese characters to reprocess CSV files

  • 2020-11-03 22:30:04
  • OfStack

CSV files are typically comma-separated strings when opened with Notepad. Python's code is shown below. There are very detailed comments in the code for people of all levels to read.

1. Query the specified column and save to the new csv file.


# -*- coding: utf-8 -*- 
''''' 
Author: Good_Night 
Time: 2018/1/30 03:50 
Edition: 1.0 
''' 
#  Import required csv library  
import csv 
 
#  Create temporary files temp.csv Find the required columns  
temp_file = open("temp.csv", "w", newline='') #  If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written  
temp_csv_writer = csv.writer(temp_file, dialect="excel") 
#  read input.csv File, at this point only specified 1 Column data  
with open('input.csv') as file: 
  temp_readcsv = csv.reader(file, delimiter=',') 
  for row in temp_readcsv: #  Take out the input.csv All column data  
    temp = [row[3]] #  Gets the specified column data  
#    print(row[3])  #print() print input.csv In the file first 3 Column all data  
    temp_csv_writer.writerow(temp) #  The first 3 Columns are written to each row in a loop temp.csv In the file  
temp_file.close() 

2. Query the number of times each row appears in all rows in the specified column and save to the new csv file.


# -*- coding: utf-8 -*- 
''''' 
Author: Good_Night 
Time: 2018/1/30 03:50 
Edition: 1.0 
''' 
#  Import required csv library  
import csv 
 
#  Create temporary files temp.csv Find the required columns  
temp_file = open("temp.csv", "w", newline='') #  If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written  
temp_csv_writer = csv.writer(temp_file, dialect="excel") 
#  read input.csv File, at this point only specified 1 Column data  
with open('input.csv') as file: 
  temp_readcsv = csv.reader(file, delimiter=',') 
  for row in temp_readcsv: #  Take out the input.csv All column data  
    temp = [row[3]] #  Gets the specified column data  
#    print(row[3])  #print() print input.csv In the file first 3 Column all data  
    temp_csv_writer.writerow(temp) #  The first 3 Columns are written to each row in a loop temp.csv In the file  
temp_file.close() 
 
#  Matches the sought data on a temporary file basis , Calculate The Times generation out.csv file  
flag = 0 #  Temporary variable  
out1 = [] #  Create a new array to hold each row of data for the specified column  
time = [] #  Create a new array to hold the number of occurrences of each row of data for the specified column  
out_file = open("out.csv", "w", newline='') #  If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written  
out_csv_writer = csv.writer(out_file, dialect="excel") 
out_csv_writer.writerow(["TIMES"]) 
#  read temp.csv File, at this point only specified 1 Column data  
with open('temp.csv') as file2: 
  out_readcsv = csv.reader(file2, delimiter=',') 
  for St in out_readcsv: #  Iterate through each row of the column  
    out1.append(St) # append() Converts each row of data in a column to out1 The list (list) Is the element that the column data becomes 1 Dimensional array.  
#  print(out1[1]) #  print out1[n] The first n The first element of the original column n Line element  
  for i in range(len(out1)): # len() To obtain out1 The list (list) The number of elements in, easy to determine the number of cycles.  
#    print(out1[i]) #  print out1 List all elements and verify that the loop is correct  
    flag = out1.count(out1[i]) # count() To obtain out1 The first in the list i Number of occurrences of an element in all elements.  
    time.append(flag) #  Saves the number of occurrences of an acquired element in order to time[] In the array  
#  print(time) #  Prints the number of occurrences of all elements to determine if there was an error  
  for j in range(len(out1)): # len() get out1 The number of elements in a linked list time[] Find the subscript  
    times = [time[j]] #  The number of occurrences of the extract element  
    out_csv_writer.writerow(times) #  write out.csv In the file  
    print(times) #  Print display times  
out_file.close() 

Because is batch processing ~ so write is the number of times that all data repeat (but this is a little BUG, may read the code will know, did not go to heavy!! For example, a appears twice in line 1 and 3, and the result is a in line 1, which corresponds to 2, and a in line 3, which corresponds to 2... This is the trouble with no de-duplication. Duplicate data will be displayed again. . However, with a slight modification of 1, the number of times a certain 1 data appears can be searched ~

3. Query the number of times each row appears in all rows in the specified column, after reprocessing, and save to the new csv file.

1 generally is the number or character to weight, can directly call the corresponding function, and the Chinese character to weight ratio can only loop comparison. So this is a pretty inclusive way to do it.


# -*- coding: utf-8 -*- 
''''' 
Author: Good Night 
Time: 2018/2/7 18:50 
Edition: 2.0 
''' 
#  Import required csv library  
import csv 
 
#  Create temporary files temp.csv Find the required columns  
temp_file = open("temp.csv", "w", newline='') #  If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written  
temp_csv_writer = csv.writer(temp_file, dialect="excel") 
#  read input.csv File, at this point only specified 1 Column data  
with open('input.csv') as file: 
  temp_readcsv = csv.reader(file, delimiter=',') 
  for row in temp_readcsv: #  Take out the input.csv All column data  
    temp = [row[3]] #  Gets the specified column data  
#    print(row[3]) #print() print input.csv In the file first 3 Column all data  
    temp_csv_writer.writerow(temp) #  The first 3 Columns are written to each row in a loop temp.csv In the file  
temp_file.close() 
 
#  Matches the sought data on a temporary file basis , Calculate The Times generation out.csv file  
out1 = [] #  Create a new array to hold each row of data for the specified column  
out_time = [] #  Create a new array to hold the number of occurrences of each row of data for the specified column  
out_file = open("out.csv", "w", newline='') #  If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written  
out_csv_writer = csv.writer(out_file, dialect="excel") 
out_csv_writer.writerow(["ID", "TIMES"]) #  Write the title   Data, number of occurrences  
#  read temp.csv File, at this point only specified 1 Column data  
with open('temp.csv') as file2: 
  out_readcsv = csv.reader(file2, delimiter=',') 
  for St in out_readcsv: #  Iterate through each row of the column  
    out1.append(St) # append() Converts each row of data in a column to out1 The list (list) Is the element that the column data becomes 1 Dimensional array.  
  print(out1)  #  print out1[n] The first n The first element of the original column n Line element  
 
# list The iteration of list The sequence number of the middle item will not be traversed because list Change for change,  
#  It's always numbered 0,1,2... Traverse. When deleting one of them 1 The term, the term after it 1 The sequence Numbers of the items move forward 1 item . 
#  When traversing list , if found 0 The term is repeated ( Because inline functions all handle control 1 To see a ) "And removed it. When removing the 0 Item,  
#  The original 1 Entry into 0 Terms, and so on. At this time list Iteration by 1 start (0 Item has been ) But at this point 1 Is the original list the 2 The term, that's just leaving out the original list the 1 Item!!!  
  #  Can be list The essence of reverse deletion is to delete an item backwards when it is found to have duplicates.  
  #  Such as iteration 0 discovery 1 The term is its duplicate, so delete 1 Item, delete 1 After the item 2 A variable 1 Term, and this time list Iteration regularization is here 1 Items.  
  #  From the original list In terms of, I'm going to skip it 1 Items. But it doesn't affect de-weighting, because it skips the repetition.  
  # list On the contrary, non-repeating iteration makes deduplication more efficient and non-repeating items will not be missed. So the original list The core problem of direct de-duplication is not iteration's missing items, but iteration's inability to omit non-duplicates.  
  for i in out1: 
    a = out1.count(i) #  Take the element  
    out_time.append(a) #  You get the number of occurrences  
#    print(i, a) 
    if a > 1: 
      out1.reverse() #  will list Reverse to delete  
      for k in range(1, a): 
        out1.remove(i) #  Delete from back to front until the first digit 1 So far, that is to delete the following, and retain the number 1 A!!!!  
      out1.reverse() #  will list Flip it back , Make sure the next loop is deleted from the original order again  
  print(out1) #  At this time out1 The list (list) That is, after the weight is removed list 
  print(out_time) #  The number of occurrences of an element  
  for j in range(len(out1)): # len() get out1 The number of elements in a linked list time[] Find the subscript  
    out_row = [out1[j], out_time[j]]  #  Take the element and the corresponding number of times  
    out_csv_writer.writerow(out_row) #  write out.csv In the file  
out_file.close() 

To highlight! This code has been dereprocessed, there is no need to worry about duplicate data display

Python processes this kind of data quite quickly, about 10,000 rows in a second...


Related articles: