Python determines instances of file and string encoding types

2020-06-19 10:42:59
OfStack

python determines file and string encoding types using the chardet toolkit, which recognizes most encoding types. However, when reading an Windows notepad saved txt file a few days ago, GBK was identified as KOI8-ES7en with no solution.

Then I wrote a simple code recognition method, the code is as follows:

coding.py


#  Description: UTF Compatible with ISO8859-1 and ASCII . GB18030 Compatible with GBK . GBK Compatible with GB2312 . GB2312 Compatible with ASCII
CODES = ['UTF-8', 'UTF-16', 'GB18030', 'BIG5']
# UTF-8 BOM The prefix byte 
UTF_8_BOM = b'\xef\xbb\xbf'

#  Gets the file encoding type 
def file_encoding(file_path):
 """
  Gets the file encoding type \n
 :param file_path:  The file path \n
 :return: \n
 """
 with open(file_path, 'rb') as f:
  return string_encoding(f.read())

#  Gets the character encoding type 
def string_encoding(b: bytes):
 """
  Gets the character encoding type \n
 :param b:  Bytes of data \n
 :return: \n
 """
 #  Traversal the encoding type 
 for code in CODES:
  try:
   b.decode(encoding=code)
   if 'UTF-8' == code and b.startswith(UTF_8_BOM):
    return 'UTF-8-SIG'
   return code
  except Exception:
   continue
 return ' Unknown character encoding type '

Note: file_encoding method is used to determine the file encoding type. The parameter is the file path. The string_encoding method is used to determine the string encoding type as the byte data corresponding to the string

Use examples:


import coding
file_name = input(' Please enter the path of the file to be identified: \n')
encoding = coding.file_encoding(file_name)
print(encoding)