python batch query Chinese characters to reprocess CSV files
- 2020-11-03 22:30:04
- OfStack
CSV files are typically comma-separated strings when opened with Notepad. Python's code is shown below. There are very detailed comments in the code for people of all levels to read.
1. Query the specified column and save to the new csv file.
# -*- coding: utf-8 -*-
'''''
Author: Good_Night
Time: 2018/1/30 03:50
Edition: 1.0
'''
# Import required csv library
import csv
# Create temporary files temp.csv Find the required columns
temp_file = open("temp.csv", "w", newline='') # If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written
temp_csv_writer = csv.writer(temp_file, dialect="excel")
# read input.csv File, at this point only specified 1 Column data
with open('input.csv') as file:
temp_readcsv = csv.reader(file, delimiter=',')
for row in temp_readcsv: # Take out the input.csv All column data
temp = [row[3]] # Gets the specified column data
# print(row[3]) #print() print input.csv In the file first 3 Column all data
temp_csv_writer.writerow(temp) # The first 3 Columns are written to each row in a loop temp.csv In the file
temp_file.close()
2. Query the number of times each row appears in all rows in the specified column and save to the new csv file.
# -*- coding: utf-8 -*-
'''''
Author: Good_Night
Time: 2018/1/30 03:50
Edition: 1.0
'''
# Import required csv library
import csv
# Create temporary files temp.csv Find the required columns
temp_file = open("temp.csv", "w", newline='') # If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written
temp_csv_writer = csv.writer(temp_file, dialect="excel")
# read input.csv File, at this point only specified 1 Column data
with open('input.csv') as file:
temp_readcsv = csv.reader(file, delimiter=',')
for row in temp_readcsv: # Take out the input.csv All column data
temp = [row[3]] # Gets the specified column data
# print(row[3]) #print() print input.csv In the file first 3 Column all data
temp_csv_writer.writerow(temp) # The first 3 Columns are written to each row in a loop temp.csv In the file
temp_file.close()
# Matches the sought data on a temporary file basis , Calculate The Times generation out.csv file
flag = 0 # Temporary variable
out1 = [] # Create a new array to hold each row of data for the specified column
time = [] # Create a new array to hold the number of occurrences of each row of data for the specified column
out_file = open("out.csv", "w", newline='') # If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written
out_csv_writer = csv.writer(out_file, dialect="excel")
out_csv_writer.writerow(["TIMES"])
# read temp.csv File, at this point only specified 1 Column data
with open('temp.csv') as file2:
out_readcsv = csv.reader(file2, delimiter=',')
for St in out_readcsv: # Iterate through each row of the column
out1.append(St) # append() Converts each row of data in a column to out1 The list (list) Is the element that the column data becomes 1 Dimensional array.
# print(out1[1]) # print out1[n] The first n The first element of the original column n Line element
for i in range(len(out1)): # len() To obtain out1 The list (list) The number of elements in, easy to determine the number of cycles.
# print(out1[i]) # print out1 List all elements and verify that the loop is correct
flag = out1.count(out1[i]) # count() To obtain out1 The first in the list i Number of occurrences of an element in all elements.
time.append(flag) # Saves the number of occurrences of an acquired element in order to time[] In the array
# print(time) # Prints the number of occurrences of all elements to determine if there was an error
for j in range(len(out1)): # len() get out1 The number of elements in a linked list time[] Find the subscript
times = [time[j]] # The number of occurrences of the extract element
out_csv_writer.writerow(times) # write out.csv In the file
print(times) # Print display times
out_file.close()
Because is batch processing ~ so write is the number of times that all data repeat (but this is a little BUG, may read the code will know, did not go to heavy!! For example, a appears twice in line 1 and 3, and the result is a in line 1, which corresponds to 2, and a in line 3, which corresponds to 2... This is the trouble with no de-duplication. Duplicate data will be displayed again. . However, with a slight modification of 1, the number of times a certain 1 data appears can be searched ~
3. Query the number of times each row appears in all rows in the specified column, after reprocessing, and save to the new csv file.
1 generally is the number or character to weight, can directly call the corresponding function, and the Chinese character to weight ratio can only loop comparison. So this is a pretty inclusive way to do it.
# -*- coding: utf-8 -*-
'''''
Author: Good Night
Time: 2018/2/7 18:50
Edition: 2.0
'''
# Import required csv library
import csv
# Create temporary files temp.csv Find the required columns
temp_file = open("temp.csv", "w", newline='') # If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written
temp_csv_writer = csv.writer(temp_file, dialect="excel")
# read input.csv File, at this point only specified 1 Column data
with open('input.csv') as file:
temp_readcsv = csv.reader(file, delimiter=',')
for row in temp_readcsv: # Take out the input.csv All column data
temp = [row[3]] # Gets the specified column data
# print(row[3]) #print() print input.csv In the file first 3 Column all data
temp_csv_writer.writerow(temp) # The first 3 Columns are written to each row in a loop temp.csv In the file
temp_file.close()
# Matches the sought data on a temporary file basis , Calculate The Times generation out.csv file
out1 = [] # Create a new array to hold each row of data for the specified column
out_time = [] # Create a new array to hold the number of occurrences of each row of data for the specified column
out_file = open("out.csv", "w", newline='') # If you don't specify newline='', The write to each 1 To have a 1 Blank lines are written
out_csv_writer = csv.writer(out_file, dialect="excel")
out_csv_writer.writerow(["ID", "TIMES"]) # Write the title Data, number of occurrences
# read temp.csv File, at this point only specified 1 Column data
with open('temp.csv') as file2:
out_readcsv = csv.reader(file2, delimiter=',')
for St in out_readcsv: # Iterate through each row of the column
out1.append(St) # append() Converts each row of data in a column to out1 The list (list) Is the element that the column data becomes 1 Dimensional array.
print(out1) # print out1[n] The first n The first element of the original column n Line element
# list The iteration of list The sequence number of the middle item will not be traversed because list Change for change,
# It's always numbered 0,1,2... Traverse. When deleting one of them 1 The term, the term after it 1 The sequence Numbers of the items move forward 1 item .
# When traversing list , if found 0 The term is repeated ( Because inline functions all handle control 1 To see a ) "And removed it. When removing the 0 Item,
# The original 1 Entry into 0 Terms, and so on. At this time list Iteration by 1 start (0 Item has been ) But at this point 1 Is the original list the 2 The term, that's just leaving out the original list the 1 Item!!!
# Can be list The essence of reverse deletion is to delete an item backwards when it is found to have duplicates.
# Such as iteration 0 discovery 1 The term is its duplicate, so delete 1 Item, delete 1 After the item 2 A variable 1 Term, and this time list Iteration regularization is here 1 Items.
# From the original list In terms of, I'm going to skip it 1 Items. But it doesn't affect de-weighting, because it skips the repetition.
# list On the contrary, non-repeating iteration makes deduplication more efficient and non-repeating items will not be missed. So the original list The core problem of direct de-duplication is not iteration's missing items, but iteration's inability to omit non-duplicates.
for i in out1:
a = out1.count(i) # Take the element
out_time.append(a) # You get the number of occurrences
# print(i, a)
if a > 1:
out1.reverse() # will list Reverse to delete
for k in range(1, a):
out1.remove(i) # Delete from back to front until the first digit 1 So far, that is to delete the following, and retain the number 1 A!!!!
out1.reverse() # will list Flip it back , Make sure the next loop is deleted from the original order again
print(out1) # At this time out1 The list (list) That is, after the weight is removed list
print(out_time) # The number of occurrences of an element
for j in range(len(out1)): # len() get out1 The number of elements in a linked list time[] Find the subscript
out_row = [out1[j], out_time[j]] # Take the element and the corresponding number of times
out_csv_writer.writerow(out_row) # write out.csv In the file
out_file.close()
To highlight! This code has been dereprocessed, there is no need to worry about duplicate data display
Python processes this kind of data quite quickly, about 10,000 rows in a second...