Python a method of reading a file using the list or dict field patterns
- 2020-05-19 05:08:12
- OfStack
preface
Python is a great tool for handling text data. It's extremely simple to read, split, filter, and convert, so developers don't have to worry about complicated file processing (as opposed to JAVA, hehe). Some of the complex text data processing and calculation in the blogger's own work, including writing the Streaming program on HADOOP, is done using Python.
In text processing, loading a file into memory is the first step, which involves mapping a column in a file to a specific variable. The most stupid way to do this is to refer to the subscript of a field, like this:
# fields Is to read the 1 Line, and the list after the delimiter
user_id = fields[0]
user_name = fields[1]
user_type = fields[2]
If read in this way, 1 file order, add or subtract column changes, the maintenance of the code is a nightmare, this code 1 must be eliminated.
This article recommends two elegant ways to read data, both of which are to configure the field mode and then read according to the mode. The mode has two forms: dictionary mode and list mode.
Read the file and split it into a list of field data by delimiter
First, read the file, split the data for each row by a delimiter, and return a list of fields for subsequent processing.
The code is as follows:
def read_file_data(filepath):
''' Read the file by the path by the line , parameter filepath : the absolute path to the file
@param filepath: Read the path to the file
@return: According to the \t A list of data for each row after segmentation
'''
fin = open(filepath, 'r')
for line in fin:
try:
line = line[:-1]
if not line: continue
except:
continue
try:
fields = line.split("\t")
except:
continue
# Throws a split list of the current row
yield fields
fin.close()
Use the yield keyword to split the data by throwing a single row at a time so that it can be used in the scheduler
for fields in read_file_data(fpath)
To read each row.
Mapping to model 1: assemble the list of data read using the configured dictionary pattern
This method configures 1 dictionary {" field name ": field position} as the data mode, then assemits the list data read according to the mode, and finally realizes the access of the data in the way of the dictionary.
Functions used:
@staticmethod
def map_fields_dict_schema(fields, dict_schema):
""" Returns the corresponding value of the mode and data value according to the mode of the field. For example, fields for ['a','b','c'],schema for {'name':0, 'age':1} , then return {'name':'a','age':'b'}
@param fields: An array of data, 1 Generally is through right 1 a Line String By following \t Division to get
@param dict_schema: 1 A dictionary. key Is the field name, value Is the location of the field;
@return: Dictionary. key Is the field name, value Is the value of a field
"""
pdict = {}
for fstr, findex in dict_schema.iteritems():
pdict[fstr] = str(fields[int(findex)])
return pdict
With this method and the previous method, you can read the data in the following ways:
# coding:utf8
"""
@author: www.crazyant.net
The test used dictionary mode to load the list of data
Advantage: for multi-column files, you can read the data for the corresponding columns simply by configuring the fields to be read
Disadvantages: if there are more fields, the location of each field configuration, more trouble
"""
import file_util
import pprint
# Configured to read the dictionary pattern, you can configure only the location of the columns you care about
dict_schema = {"userid":0, "username":1, "usertype":2}
for fields in file_util.FileUtil.read_file_data("userfile.txt"):
# Map the list of fields in dictionary mode
dict_fields = file_util.FileUtil.map_fields_dict_schema(fields, dict_schema)
pprint.pprint(dict_fields)
Output results:
{'userid': '1', 'username': 'name1', 'usertype': '0'}
{'userid': '2', 'username': 'name2', 'usertype': '1'}
{'userid': '3', 'username': 'name3', 'usertype': '2'}
{'userid': '4', 'username': 'name4', 'usertype': '3'}
{'userid': '5', 'username': 'name5', 'usertype': '4'}
{'userid': '6', 'username': 'name6', 'usertype': '5'}
{'userid': '7', 'username': 'name7', 'usertype': '6'}
{'userid': '8', 'username': 'name8', 'usertype': '7'}
{'userid': '9', 'username': 'name9', 'usertype': '8'}
{'userid': '10', 'username': 'name10', 'usertype': '9'}
{'userid': '11', 'username': 'name11', 'usertype': '10'}
{'userid': '12', 'username': 'name12', 'usertype': '11'}
Mapping to model 2: assemble a list of read data using a configured list pattern
If you need to read all the columns in the file, or the previous 1, then the advantage of the configuration dictionary mode is complex, because you need to configure the index positions for each field, and these positions start at 0 and end, which are low-level labor and need to be eliminated.
The list pattern is born by fate, by first converting the configured list pattern into the dictionary pattern, and then loading by the dictionary.
Conversion mode, and code read in list mode:
@staticmethod
def transform_list_to_dict(para_list):
""" the ['a', 'b'] Converted to {'a':0, 'b':1} In the form of
@param para_list: A list of the field names for each column
@return: Dictionary, which is the mapping of the field name and location
"""
res_dict = {}
idx = 0
while idx < len(para_list):
res_dict[str(para_list[idx]).strip()] = idx
idx += 1
return res_dict
@staticmethod
def map_fields_list_schema(fields, list_schema):
""" Returns the corresponding value of the mode and data value according to the mode of the field. For example, fields for ['a','b','c'],schema for {'name', 'age'} , then return {'name':'a','age':'b'}
@param fields: An array of data, 1 Generally is through right 1 a Line String By following \t Division to get
@param list_schema: A list of column names list
@return: Dictionary. key Is the field name, value Is the value of a field
"""
dict_schema = FileUtil.transform_list_to_dict(list_schema)
return FileUtil.map_fields_dict_schema(fields, dict_schema)
When used, you can configure the mode in the form of a list, which is more concise without the need to configure the index:
# coding:utf8
"""
@author: www.crazyant.net
The test used list mode to load the data list
Advantage: if you read all the columns, you can simply write out the field names of each column in order in list mode
Disadvantages: you can't just read the fields you care about, you need to read them all
"""
import file_util
import pprint
# The configured list mode to read can only configure the preceding columns, or all of them
list_schema = ["userid", "username", "usertype"]
for fields in file_util.FileUtil.read_file_data("userfile.txt"):
# Map the list of fields in dictionary mode
dict_fields = file_util.FileUtil.map_fields_list_schema(fields, list_schema)
pprint.pprint(dict_fields)
The results are exactly the same as in dictionary mode.
file_util.py full code
Here is the entire code in file_util.py, which you can use in your own common class library
# -*- encoding:utf8 -*-
'''
@author: www.crazyant.net
@version: 2014-12-5
'''
class FileUtil(object):
''' Common operation methods of file and path
'''
@staticmethod
def read_file_data(filepath):
''' Read the file by the path by the line , parameter filepath : the absolute path to the file
@param filepath: Read the path to the file
@return: According to the \t A list of data for each row after segmentation
'''
fin = open(filepath, 'r')
for line in fin:
try:
line = line[:-1]
if not line: continue
except:
continue
try:
fields = line.split("\t")
except:
continue
# Throws a split list of the current row
yield fields
fin.close()
@staticmethod
def transform_list_to_dict(para_list):
""" the ['a', 'b'] Converted to {'a':0, 'b':1} In the form of
@param para_list: A list of the field names for each column
@return: Dictionary, which is the mapping of the field name and location
"""
res_dict = {}
idx = 0
while idx < len(para_list):
res_dict[str(para_list[idx]).strip()] = idx
idx += 1
return res_dict
@staticmethod
def map_fields_list_schema(fields, list_schema):
""" Returns the corresponding value of the mode and data value according to the mode of the field. For example, fields for ['a','b','c'],schema for {'name', 'age'} , then return {'name':'a','age':'b'}
@param fields: An array of data, 1 Generally is through right 1 a Line String By following \t Division to get
@param list_schema: A list of column names list
@return: Dictionary. key Is the field name, value Is the value of a field
"""
dict_schema = FileUtil.transform_list_to_dict(list_schema)
return FileUtil.map_fields_dict_schema(fields, dict_schema)
@staticmethod
def map_fields_dict_schema(fields, dict_schema):
""" Returns the corresponding value of the mode and data value according to the mode of the field. For example, fields for ['a','b','c'],schema for {'name':0, 'age':1} , then return {'name':'a','age':'b'}
@param fields: An array of data, 1 Generally is through right 1 a Line String By following \t Division to get
@param dict_schema: 1 A dictionary. key Is the field name, value Is the location of the field;
@return: Dictionary. key Is the field name, value Is the value of a field
"""
pdict = {}
for fstr, findex in dict_schema.iteritems():
pdict[fstr] = str(fields[int(findex)])
return pdict
conclusion
The above is the whole content of this article, I hope the content of this article for everyone to learn or use python can have 1 definite help, if you have any questions, you can leave a message to communicate.