Analysis on Common Data File Processing Methods of python

  • 2021-12-11 07:50:19
  • OfStack

0. Preface

Although python runs slowly, its programming speed and the richness of the third package are really high.
When it comes to file batch processing, python will still be selected.

1. Dynamic filename

In file batch processing, file names are often different only by number, and dynamic file names can be obtained by passing different numbers to strings.


file_num = 324
# file_num = 1
for i in range(file_num):
	file_name = " Normal data \\{}. Normal .txt".format(i + 1)
	...

2. Convert files to csv format

In order to save storage space, all data providers store them in txt files in a specified format, which may not be friendly to computers. The comma file csv format can be easily read by numpy, pandas and other data processing packages.
First, get each line of data by reading line by line (most data files have the same format per line. If there is only one line of data, you can read it all or read it character by character), and then delete the newline characters of each line by line. replace ('\ n', ''), so as to avoid blank lines in the final csv file.
Use line. split (':') to decompose a string into multiple fields.
Write the entire row through csv. writer.


import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
    index = 0
    for line in f:
        #  Write header 
        if index == 0:
            csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
                                 'ranging', 'check', 'SerialNumber', 'DataID'])
            index = index + 1
            continue
        line = line.replace('\n', '')
        str = line.split(':')
        csv_writer.writerow(str)

3. Preliminary processing of csv files

The csv files we get at the beginning are often unwanted and need simple processing.
For example, I want to merge 4 rows of data into 1 row.
Use pandas to read the csv file into a table df. Simply make the format you want to generate a file with a title and 1 row of data, and read it into another table df2.
You can use the


del df['T']

To delete the specified column.

It can be passed through


df2.loc[row] = list

To determine 1 row of data for the new file. pandas Access Row Data


import pandas as pd

df = pd.read_csv(file_path + file_name)
#  Delete some columns 
del df['T']
del df['RangeReport']
del df['TagID']

#  Judge the same 1DataID Corresponding SerialNumber Whether it is the same 
# SerialNumberBegin = df['SerialNumber'][0]
# DataIDBegin = df['DataID'][0]
# for row in range(df.shape[0]):
#     c = df['SerialNumber'][row] != (SerialNumberBegin + int(row / 4)) % 256
#     d = df['DataID'][row] != DataIDBegin + int(row / 4)
#     e = df['AnchorID'][row] != row % 4
#     if c | d | e:
#         print('err')
del df['AnchorID']

# print(type(df['TimeStamp'][0]))
#  Perform table merging 
df2 = pd.read_csv(file_path + " Merge format .csv")
for row in range(int(df.shape[0]/4)):
    list = [3304,229,90531088,90531088,90531088,90531088,760,760,760,760,760,760,760,760]
    # DataID,SerialNumber,TimeStamp0,TimeStamp1,TimeStamp2,TimeStamp3,ranging0,check0,ranging1,check1,ranging2,check2,ranging3,check3
    list[0] = df['DataID'][row*4]
    list[1] = df['SerialNumber'][row*4]
    list[2] = df['TimeStamp'][row*4+0]
    list[3] = df['TimeStamp'][row*4+1]
    list[4] = df['TimeStamp'][row*4+2]
    list[5] = df['TimeStamp'][row*4+3]
    list[6]  = df['ranging'][row*4+0]
    list[7]  = df['check'][row*4+0]
    list[8]  = df['ranging'][row*4+1]
    list[9]  = df['check'][row*4+1]
    list[10] = df['ranging'][row*4+2]
    list[11] = df['check'][row*4+2]
    list[12] = df['ranging'][row*4+3]
    list[13] = df['check'][row*4+3]

    df2.loc[row] = list
df2.to_csv(file_path+contact_name)

4. Get some data

It can be passed through


df0 = df.iloc[:, 3:7]

Or


df0 = df[["check0","check1","check2","check3"]]

To get some columns of a table.

5. Format conversion between data

Generally, data conversion will be carried out among list, numpy and pandas.
When you create your own data, you often use the


y_show = []
y_show.append(n_clusters_)

After the dimension is adjusted, it can be 1-dimensional or multi-dimensional, and then converted to numpy or pandas.
The method of converting to numpy is as follows


y = np.array(y_show)

6. Treatment of outliers and overlapping points

DBSCAN algorithm is used for clustering. Specific algorithm description can be found casually.
There are two important parameters, one is the clustering radius and the other is the minimum number of neighbors.
Specifying a larger radius and a larger number of neighbors can filter out discrete points.
Specify a smaller radius to filter out overlapping points and similar points.
The code is as follows, using an numpy matrix of n*m as input to cluster points in the m dimension.
The labels is obtained through 1-pass operation, which is an map, and the key value is the int value,-1, 0, 1, 2... -1 represents outliers, and others represent clusters. value is an list, representing the subscript of the points of each cluster.


from sklearn.cluster import DBSCAN

y = df[["d0","d1","d2","d3"]].to_numpy()

db = DBSCAN(eps=3, min_samples=2).fit(y)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

#  In statistical cluster labels Quantity of 
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

7. Data plotting

Drawing 2-D is relatively simple, and only 3-D drawing code is pasted here


import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
    index = 0
    for line in f:
        #  Write header 
        if index == 0:
            csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
                                 'ranging', 'check', 'SerialNumber', 'DataID'])
            index = index + 1
            continue
        line = line.replace('\n', '')
        str = line.split(':')
        csv_writer.writerow(str)
0

8. Matrix Operation of numpy


import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
    index = 0
    for line in f:
        #  Write header 
        if index == 0:
            csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
                                 'ranging', 'check', 'SerialNumber', 'DataID'])
            index = index + 1
            continue
        line = line.replace('\n', '')
        str = line.split(':')
        csv_writer.writerow(str)
1

9. Save the file

You can use csv writerow to save files, see 1.
You can also use numpy or pandas to save files.
If you use pandas's directly


import csv
outFile = open(file_path + outFile_name, 'w', encoding='utf-8', newline='' "")
csv_writer = csv.writer(outFile)
with open(file_path + file_name, "r") as f:
    index = 0
    for line in f:
        #  Write header 
        if index == 0:
            csv_writer.writerow(['T', 'TimeStamp', 'RangeReport', 'TagID', 'AnchorID',
                                 'ranging', 'check', 'SerialNumber', 'DataID'])
            index = index + 1
            continue
        line = line.replace('\n', '')
        str = line.split(':')
        csv_writer.writerow(str)
2

Saving the file will save an extra line of index. It can be controlled by the parameter index=False.
For additional requirements, see pd.to_csv

You can also use numpy to store 1 numpy type data in a file in a specified format. Here, 1 should specify the format, otherwise it may be saved to an undesirable type.


np.savetxt(file_path + " Abnormal data .txt", np.array(y_show,dtype=np.int16), fmt="%d")

Related articles: