Analysis on Common Data File Processing Methods of python
- 2021-12-11 07:50:19
- OfStack
0. Preface
Although python runs slowly, its programming speed and the richness of the third package are really high.
When it comes to file batch processing, python will still be selected.
1. Dynamic filename
In file batch processing, file names are often different only by number, and dynamic file names can be obtained by passing different numbers to strings.
file_num = 324
# file_num = 1
for i in range(file_num):
file_name = " Normal data \\{}. Normal .txt".format(i + 1)
...
2. Convert files to csv format
In order to save storage space, all data providers store them in txt files in a specified format, which may not be friendly to computers. The comma file csv format can be easily read by numpy, pandas and other data processing packages.
First, get each line of data by reading line by line (most data files have the same format per line. If there is only one line of data, you can read it all or read it character by character), and then delete the newline characters of each line by line. replace ('\ n', ''), so as to avoid blank lines in the final csv file.
Use line. split (':') to decompose a string into multiple fields.
Write the entire row through csv. writer.
import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
index = 0
for line in f:
# Write header
if index == 0:
csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
'ranging', 'check', 'SerialNumber', 'DataID'])
index = index + 1
continue
line = line.replace('\n', '')
str = line.split(':')
csv_writer.writerow(str)
3. Preliminary processing of csv files
The csv files we get at the beginning are often unwanted and need simple processing.
For example, I want to merge 4 rows of data into 1 row.
Use pandas to read the csv file into a table df. Simply make the format you want to generate a file with a title and 1 row of data, and read it into another table df2.
You can use the
del df['T']
To delete the specified column.
It can be passed through
df2.loc[row] = list
To determine 1 row of data for the new file. pandas Access Row Data
import pandas as pd
df = pd.read_csv(file_path + file_name)
# Delete some columns
del df['T']
del df['RangeReport']
del df['TagID']
# Judge the same 1DataID Corresponding SerialNumber Whether it is the same
# SerialNumberBegin = df['SerialNumber'][0]
# DataIDBegin = df['DataID'][0]
# for row in range(df.shape[0]):
# c = df['SerialNumber'][row] != (SerialNumberBegin + int(row / 4)) % 256
# d = df['DataID'][row] != DataIDBegin + int(row / 4)
# e = df['AnchorID'][row] != row % 4
# if c | d | e:
# print('err')
del df['AnchorID']
# print(type(df['TimeStamp'][0]))
# Perform table merging
df2 = pd.read_csv(file_path + " Merge format .csv")
for row in range(int(df.shape[0]/4)):
list = [3304,229,90531088,90531088,90531088,90531088,760,760,760,760,760,760,760,760]
# DataID,SerialNumber,TimeStamp0,TimeStamp1,TimeStamp2,TimeStamp3,ranging0,check0,ranging1,check1,ranging2,check2,ranging3,check3
list[0] = df['DataID'][row*4]
list[1] = df['SerialNumber'][row*4]
list[2] = df['TimeStamp'][row*4+0]
list[3] = df['TimeStamp'][row*4+1]
list[4] = df['TimeStamp'][row*4+2]
list[5] = df['TimeStamp'][row*4+3]
list[6] = df['ranging'][row*4+0]
list[7] = df['check'][row*4+0]
list[8] = df['ranging'][row*4+1]
list[9] = df['check'][row*4+1]
list[10] = df['ranging'][row*4+2]
list[11] = df['check'][row*4+2]
list[12] = df['ranging'][row*4+3]
list[13] = df['check'][row*4+3]
df2.loc[row] = list
df2.to_csv(file_path+contact_name)
4. Get some data
It can be passed through
df0 = df.iloc[:, 3:7]
Or
df0 = df[["check0","check1","check2","check3"]]
To get some columns of a table.
5. Format conversion between data
Generally, data conversion will be carried out among list, numpy and pandas.
When you create your own data, you often use the
y_show = []
y_show.append(n_clusters_)
After the dimension is adjusted, it can be 1-dimensional or multi-dimensional, and then converted to numpy or pandas.
The method of converting to numpy is as follows
y = np.array(y_show)
6. Treatment of outliers and overlapping points
DBSCAN algorithm is used for clustering. Specific algorithm description can be found casually.
There are two important parameters, one is the clustering radius and the other is the minimum number of neighbors.
Specifying a larger radius and a larger number of neighbors can filter out discrete points.
Specify a smaller radius to filter out overlapping points and similar points.
The code is as follows, using an numpy matrix of n*m as input to cluster points in the m dimension.
The labels is obtained through 1-pass operation, which is an map, and the key value is the int value,-1, 0, 1, 2... -1 represents outliers, and others represent clusters. value is an list, representing the subscript of the points of each cluster.
from sklearn.cluster import DBSCAN
y = df[["d0","d1","d2","d3"]].to_numpy()
db = DBSCAN(eps=3, min_samples=2).fit(y)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# In statistical cluster labels Quantity of
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
7. Data plotting
Drawing 2-D is relatively simple, and only 3-D drawing code is pasted here
import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
index = 0
for line in f:
# Write header
if index == 0:
csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
'ranging', 'check', 'SerialNumber', 'DataID'])
index = index + 1
continue
line = line.replace('\n', '')
str = line.split(':')
csv_writer.writerow(str)
0
8. Matrix Operation of numpy
import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
index = 0
for line in f:
# Write header
if index == 0:
csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
'ranging', 'check', 'SerialNumber', 'DataID'])
index = index + 1
continue
line = line.replace('\n', '')
str = line.split(':')
csv_writer.writerow(str)
1
9. Save the file
You can use csv writerow to save files, see 1.
You can also use numpy or pandas to save files.
If you use pandas's directly
import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
index = 0
for line in f:
# Write header
if index == 0:
csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
'ranging', 'check', 'SerialNumber', 'DataID'])
index = index + 1
continue
line = line.replace('\n', '')
str = line.split(':')
csv_writer.writerow(str)
2
Saving the file will save an extra line of index. It can be controlled by the parameter index=False.
For additional requirements, see pd.to_csv
You can also use numpy to store 1 numpy type data in a file in a specified format. Here, 1 should specify the format, otherwise it may be saved to an undesirable type.
np.savetxt(file_path + " Abnormal data .txt", np.array(y_show,dtype=np.int16), fmt="%d")