python is a perfect solution to the problem of collecting Chinese scrambled codes
- 2020-05-12 02:51:13
- OfStack
In recent days, when I encountered the collection of a certain web page, most of the web page OK, a small part of the web page appeared garbled code problems, debugging for a few days, finally found that it is caused by 1 illegal characters.. I wish to record
1. Under normal circumstances... You can use
import chardet
thischarset = chardet.detect(strs)["encoding"]
To get the encoding of the file or page
Or simply grab charset = xxxx of the page
2. When there are special characters in the content, the specified encoding 1 will cause chaos. That is, illegal characters in the content caused by encoding can be used to ignore the way of illegal characters to deal with.
strs = strs.decode("UTF-8","ignore").encode("UTF-8")
The second parameter of decode represents the way in which illegal characters are encountered
This parameter defaults to throwing an exception.