python is a perfect solution to the problem of collecting Chinese scrambled codes

2020-05-12 02:51:13
OfStack

In recent days, when I encountered the collection of a certain web page, most of the web page OK, a small part of the web page appeared garbled code problems, debugging for a few days, finally found that it is caused by 1 illegal characters.. I wish to record

1. Under normal circumstances... You can use


import chardet

thischarset = chardet.detect(strs)["encoding"]

To get the encoding of the file or page

Or simply grab charset = xxxx of the page

2. When there are special characters in the content, the specified encoding 1 will cause chaos. That is, illegal characters in the content caused by encoding can be used to ignore the way of illegal characters to deal with.


strs = strs.decode("UTF-8","ignore").encode("UTF-8")

The second parameter of decode represents the way in which illegal characters are encountered

This parameter defaults to throwing an exception.