Python dynamic detection code chardet tutorial

  • 2020-06-07 04:48:37
  • OfStack

preface

In the world of the Internet, every page is coded, but how does all that coding make our code know what it's made of? charset is a good solution to this problem.

1. chardet

chardet is a class library package provided by the Python community for us to dynamically detect the encoding format information in the current page or file in the code. The interface is very simple and easy to use.

Project homepage: https: / / github com/chardet/chardet

Local download address: http: / / xiazai ofstack. com / 201707 / yuanma/chardet (ofstack. com). rar

Document homepage: http: / / chardet readthedocs. io en/latest/usage html

2. Use examples

Notice: python 3.5 + used by the author

Case 1: Detects the encoding format of a particular page


import chardet
import urllib.request
TestData = urllib.request.urlopen('http://www.baidu.com/').read()
print(chardet.detect(TestData))

Output results:


{'confidence': 0.99, 'encoding': 'utf-8'}

Results The accuracy rate was 99% and the encoding format was ES63en-8

Instructions for use: detect() For its key method

Case 2: Incremental detection encoding format


import urllib.request
from chardet.universaldetector import UniversalDetector
usock = urllib.request.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print(detector.result)

Output results:


{'confidence': 0.99, 'encoding': 'utf-8'}

Note: In order to improve the accuracy of the prediction, based on dector.feed() To achieve continuous information input, after the information is sufficient enough to end the information input, give the corresponding prediction and judgment.

If you want to reuse the detector method, do so detector.reset() Resets so that they can be reused.

Case 3: After installing chardet, you can detect file encoding based on the command line


% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

At the system level, file encoding detection can be carried out directly based on the command line, which is very simple and easy to use.

3. Summary

chardet is a very easy to use and powerful Python package, I believe you will use chardet when traveling in the WORLD of web. If you have any questions, please feedback to me.


Related articles: