Python determines instances of file and string encoding types
- 2020-06-19 10:42:59
- OfStack
python determines file and string encoding types using the chardet toolkit, which recognizes most encoding types. However, when reading an Windows notepad saved txt file a few days ago, GBK was identified as KOI8-ES7en with no solution.
Then I wrote a simple code recognition method, the code is as follows:
coding.py
# Description: UTF Compatible with ISO8859-1 and ASCII . GB18030 Compatible with GBK . GBK Compatible with GB2312 . GB2312 Compatible with ASCII
CODES = ['UTF-8', 'UTF-16', 'GB18030', 'BIG5']
# UTF-8 BOM The prefix byte
UTF_8_BOM = b'\xef\xbb\xbf'
# Gets the file encoding type
def file_encoding(file_path):
"""
Gets the file encoding type \n
:param file_path: The file path \n
:return: \n
"""
with open(file_path, 'rb') as f:
return string_encoding(f.read())
# Gets the character encoding type
def string_encoding(b: bytes):
"""
Gets the character encoding type \n
:param b: Bytes of data \n
:return: \n
"""
# Traversal the encoding type
for code in CODES:
try:
b.decode(encoding=code)
if 'UTF-8' == code and b.startswith(UTF_8_BOM):
return 'UTF-8-SIG'
return code
except Exception:
continue
return ' Unknown character encoding type '
Note: file_encoding method is used to determine the file encoding type. The parameter is the file path. The string_encoding method is used to determine the string encoding type as the byte data corresponding to the string
Use examples:
import coding
file_name = input(' Please enter the path of the file to be identified: \n')
encoding = coding.file_encoding(file_name)
print(encoding)