Python dynamic detection code chardet tutorial
- 2020-06-07 04:48:37
- OfStack
preface
In the world of the Internet, every page is coded, but how does all that coding make our code know what it's made of? charset is a good solution to this problem.
1. chardet
chardet is a class library package provided by the Python community for us to dynamically detect the encoding format information in the current page or file in the code. The interface is very simple and easy to use.
Project homepage: https: / / github com/chardet/chardet
Local download address: http: / / xiazai ofstack. com / 201707 / yuanma/chardet (ofstack. com). rar
Document homepage: http: / / chardet readthedocs. io en/latest/usage html
2. Use examples
Notice: python 3.5 + used by the author
Case 1: Detects the encoding format of a particular page
import chardet
import urllib.request
TestData = urllib.request.urlopen('http://www.baidu.com/').read()
print(chardet.detect(TestData))
Output results:
{'confidence': 0.99, 'encoding': 'utf-8'}
Results The accuracy rate was 99% and the encoding format was ES63en-8
Instructions for use:
detect()
For its key method
Case 2: Incremental detection encoding format
import urllib.request
from chardet.universaldetector import UniversalDetector
usock = urllib.request.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print(detector.result)
Output results:
{'confidence': 0.99, 'encoding': 'utf-8'}
Note: In order to improve the accuracy of the prediction, based on
dector.feed()
To achieve continuous information input, after the information is sufficient enough to end the information input, give the corresponding prediction and judgment.
If you want to reuse the detector method, do so
detector.reset()
Resets so that they can be reused.
Case 3: After installing chardet, you can detect file encoding based on the command line
% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
At the system level, file encoding detection can be carried out directly based on the command line, which is very simple and easy to use.
3. Summary
chardet is a very easy to use and powerful Python package, I believe you will use chardet when traveling in the WORLD of web. If you have any questions, please feedback to me.