Python a method of reading a file using the list or dict field patterns

  • 2020-05-19 05:08:12
  • OfStack

preface

Python is a great tool for handling text data. It's extremely simple to read, split, filter, and convert, so developers don't have to worry about complicated file processing (as opposed to JAVA, hehe). Some of the complex text data processing and calculation in the blogger's own work, including writing the Streaming program on HADOOP, is done using Python.

In text processing, loading a file into memory is the first step, which involves mapping a column in a file to a specific variable. The most stupid way to do this is to refer to the subscript of a field, like this:


# fields Is to read the 1 Line, and the list after the delimiter 
user_id = fields[0]
user_name = fields[1]
user_type = fields[2]

If read in this way, 1 file order, add or subtract column changes, the maintenance of the code is a nightmare, this code 1 must be eliminated.

This article recommends two elegant ways to read data, both of which are to configure the field mode and then read according to the mode. The mode has two forms: dictionary mode and list mode.

Read the file and split it into a list of field data by delimiter

First, read the file, split the data for each row by a delimiter, and return a list of fields for subsequent processing.

The code is as follows:


def read_file_data(filepath):
 ''' Read the file by the path by the line ,  parameter filepath : the absolute path to the file 
 @param filepath:  Read the path to the file 
 @return:  According to the \t A list of data for each row after segmentation 
 '''
 fin = open(filepath, 'r')
 for line in fin:
  try:
   line = line[:-1]
   if not line: continue
  except:
   continue
  
  try:
   fields = line.split("\t")
  except:
   continue
  #  Throws a split list of the current row 
  yield fields
 fin.close()

Use the yield keyword to split the data by throwing a single row at a time so that it can be used in the scheduler for fields in read_file_data(fpath) To read each row.

Mapping to model 1: assemble the list of data read using the configured dictionary pattern

This method configures 1 dictionary {" field name ": field position} as the data mode, then assemits the list data read according to the mode, and finally realizes the access of the data in the way of the dictionary.

Functions used:


@staticmethod
def map_fields_dict_schema(fields, dict_schema):
 """ Returns the corresponding value of the mode and data value according to the mode of the field. For example,  fields for ['a','b','c'],schema for {'name':0, 'age':1} , then return {'name':'a','age':'b'}
 @param fields:  An array of data, 1 Generally is through right 1 a Line String By following \t Division to get 
 @param dict_schema: 1 A dictionary. key Is the field name, value Is the location of the field; 
 @return:  Dictionary. key Is the field name, value Is the value of a field 
 """
 pdict = {}
 for fstr, findex in dict_schema.iteritems():
  pdict[fstr] = str(fields[int(findex)])
 return pdict

With this method and the previous method, you can read the data in the following ways:


# coding:utf8
"""
@author: www.crazyant.net
 The test used dictionary mode to load the list of data 
 Advantage: for multi-column files, you can read the data for the corresponding columns simply by configuring the fields to be read 
 Disadvantages: if there are more fields, the location of each field configuration, more trouble 
"""
import file_util
import pprint
 
#  Configured to read the dictionary pattern, you can configure only the location of the columns you care about 
dict_schema = {"userid":0, "username":1, "usertype":2}
for fields in file_util.FileUtil.read_file_data("userfile.txt"):
 #  Map the list of fields in dictionary mode 
 dict_fields = file_util.FileUtil.map_fields_dict_schema(fields, dict_schema)
 pprint.pprint(dict_fields)

Output results:


{'userid': '1', 'username': 'name1', 'usertype': '0'}
{'userid': '2', 'username': 'name2', 'usertype': '1'}
{'userid': '3', 'username': 'name3', 'usertype': '2'}
{'userid': '4', 'username': 'name4', 'usertype': '3'}
{'userid': '5', 'username': 'name5', 'usertype': '4'}
{'userid': '6', 'username': 'name6', 'usertype': '5'}
{'userid': '7', 'username': 'name7', 'usertype': '6'}
{'userid': '8', 'username': 'name8', 'usertype': '7'}
{'userid': '9', 'username': 'name9', 'usertype': '8'}
{'userid': '10', 'username': 'name10', 'usertype': '9'}
{'userid': '11', 'username': 'name11', 'usertype': '10'}
{'userid': '12', 'username': 'name12', 'usertype': '11'}

Mapping to model 2: assemble a list of read data using a configured list pattern

If you need to read all the columns in the file, or the previous 1, then the advantage of the configuration dictionary mode is complex, because you need to configure the index positions for each field, and these positions start at 0 and end, which are low-level labor and need to be eliminated.

The list pattern is born by fate, by first converting the configured list pattern into the dictionary pattern, and then loading by the dictionary.

Conversion mode, and code read in list mode:


@staticmethod
def transform_list_to_dict(para_list):
 """ the ['a', 'b'] Converted to {'a':0, 'b':1} In the form of 
 @param para_list:  A list of the field names for each column 
 @return:  Dictionary, which is the mapping of the field name and location 
 """
 res_dict = {}
 idx = 0
 while idx < len(para_list):
  res_dict[str(para_list[idx]).strip()] = idx
  idx += 1
 return res_dict
 
@staticmethod
def map_fields_list_schema(fields, list_schema):
 """ Returns the corresponding value of the mode and data value according to the mode of the field. For example,  fields for ['a','b','c'],schema for {'name', 'age'} , then return {'name':'a','age':'b'}
 @param fields:  An array of data, 1 Generally is through right 1 a Line String By following \t Division to get 
 @param list_schema:  A list of column names list
 @return:  Dictionary. key Is the field name, value Is the value of a field 
 """
 dict_schema = FileUtil.transform_list_to_dict(list_schema)
 return FileUtil.map_fields_dict_schema(fields, dict_schema)

When used, you can configure the mode in the form of a list, which is more concise without the need to configure the index:


# coding:utf8
"""
@author: www.crazyant.net
 The test used list mode to load the data list 
 Advantage: if you read all the columns, you can simply write out the field names of each column in order in list mode 
 Disadvantages: you can't just read the fields you care about, you need to read them all 
"""
import file_util
import pprint
 
#  The configured list mode to read can only configure the preceding columns, or all of them 
list_schema = ["userid", "username", "usertype"]
for fields in file_util.FileUtil.read_file_data("userfile.txt"):
 #  Map the list of fields in dictionary mode 
 dict_fields = file_util.FileUtil.map_fields_list_schema(fields, list_schema)
 pprint.pprint(dict_fields) 

The results are exactly the same as in dictionary mode.

file_util.py full code

Here is the entire code in file_util.py, which you can use in your own common class library


# -*- encoding:utf8 -*-
'''
@author: www.crazyant.net
@version: 2014-12-5
'''
 
class FileUtil(object):
 ''' Common operation methods of file and path 
 '''
 @staticmethod
 def read_file_data(filepath):
  ''' Read the file by the path by the line ,  parameter filepath : the absolute path to the file 
  @param filepath:  Read the path to the file 
  @return:  According to the \t A list of data for each row after segmentation 
  '''
  fin = open(filepath, 'r')
  for line in fin:
   try:
    line = line[:-1]
    if not line: continue
   except:
    continue
   
   try:
    fields = line.split("\t")
   except:
    continue
   #  Throws a split list of the current row 
   yield fields
  fin.close()
 
 @staticmethod
 def transform_list_to_dict(para_list):
  """ the ['a', 'b'] Converted to {'a':0, 'b':1} In the form of 
  @param para_list:  A list of the field names for each column 
  @return:  Dictionary, which is the mapping of the field name and location 
  """
  res_dict = {}
  idx = 0
  while idx < len(para_list):
   res_dict[str(para_list[idx]).strip()] = idx
   idx += 1
  return res_dict
 
 @staticmethod
 def map_fields_list_schema(fields, list_schema):
  """ Returns the corresponding value of the mode and data value according to the mode of the field. For example,  fields for ['a','b','c'],schema for {'name', 'age'} , then return {'name':'a','age':'b'}
  @param fields:  An array of data, 1 Generally is through right 1 a Line String By following \t Division to get 
  @param list_schema:  A list of column names list
  @return:  Dictionary. key Is the field name, value Is the value of a field 
  """
  dict_schema = FileUtil.transform_list_to_dict(list_schema)
  return FileUtil.map_fields_dict_schema(fields, dict_schema)
 
@staticmethod
def map_fields_dict_schema(fields, dict_schema):
 """ Returns the corresponding value of the mode and data value according to the mode of the field. For example,  fields for ['a','b','c'],schema for {'name':0, 'age':1} , then return {'name':'a','age':'b'}
 @param fields:  An array of data, 1 Generally is through right 1 a Line String By following \t Division to get 
 @param dict_schema: 1 A dictionary. key Is the field name, value Is the location of the field; 
 @return:  Dictionary. key Is the field name, value Is the value of a field 
 """
 pdict = {}
 for fstr, findex in dict_schema.iteritems():
  pdict[fstr] = str(fields[int(findex)])
 return pdict

conclusion

The above is the whole content of this article, I hope the content of this article for everyone to learn or use python can have 1 definite help, if you have any questions, you can leave a message to communicate.


Related articles: