Python USES chardet to determine the encoding of strings

2020-04-02 14:39:28
OfStack

This example shows how python USES chardet to determine string encoding. Share with you for your reference. Specific analysis is as follows:

Recently, using python to grab some data from the Internet, I encountered coding problems. It's a headache. Summarize the solutions.

Linux vim view fileencoding under the command set fileencoding
Chardet, a powerful coding detection package in python, is very simple to use. Simple installation using PIP install chardet under Linux


import chardet
f = open('file','r')
fencoding=chardet.detect(f.read())
print fencoding

Fencoding output format {'confidence': 0.96630842899499614, 'encoding': 'GB2312'}, can only determine the probability of a certain encoding. It's more accurate. The input parameter is of type STR.

After knowing the encoding of STR in python, decode and encode can be used to realize the encoding transformation.

The general process is that STR USES the decode method to decode STR into unicode string types based on its encoding, and then USES encode to convert unicode string types to specific encoding based on specific encoding. STR and unicode are of two different types in python, as shown below.

Generally speaking, window default code GBK, Linux default code utf8
The concept of system coding, python coding, and file coding in python programming.

System code: the default code of the editor to write the source code. It represents that everything in the source file is encoded as a binary stream by word. Stored on disk. Look through the locale command under Linux.

Python encoding: refers to the decoding method set in python. If not set, python defaults to ASCII decoding. If there is no Chinese in the python source file, this should be fine.

Setting method: at the beginning of the source file (must be the first line) : #-*-coding: utf-8 -*-, the source file is set and decoded in utf-8 or


import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

Fileencoding: text encoding, vim under Linux using set fileencoding view.

In general, the reason for the output of garbled code is not in accordance with the way of the system decoding encoding.

For example, print s, type s is STR, the default code of the system under Linux system is utf8, s should be coded as utf8 before output. If s is code for GBK it should be output like this. Print s.ode (' GBK ').encode('utf8') to output Chinese.

The same goes for window. The default code for window is GBK, so the code for s output must be GBK.

Unicode types are typically handled in python processing. In this way, the output can be directly coded.

I hope this article has helped you with your Python programming.