Solution to the problem of scrambled code in Python web crawler

  • 2020-05-19 05:09:56
  • OfStack

There are many kinds of problems about crawler scrambled codes, here is not only Chinese scrambled codes, encoding conversion, but also 1 such as Japanese, Korean, Russian, Tibetan and so on scrambled codes processing, because the solution is 1, so in this 1 explanation.

Web crawler appears the reason of garbled code

Source page encoding and crawling down the encoding format is not 1.
For example, if the source web page is a byte stream encoded by gbk, and the program directly USES utf-8 to encode and output to the stored file after we grab it, it will inevitably cause scrambled code. In other words, when the source web page is encoded and the program directly USES code 1 after it is grabbed, there will be no scrambled code. At this point again carry on unification 1's character code also won't appear the garble code

Pay attention to distinguish between

Source network code A, The program directly USES the code B, 1 conversion character encoding C.

The solution to garbled code

Determine the code A for the source page, and the code A is usually in three locations on the page

1. http header Content - Type
The site that gets the server header can use it to tell the browser about the content of the page. The 1 entry Content-Type is written as "text/html; charset = utf - 8 ".

2.meta charset

< meta http-equiv="Content-Type" content="text/html; charset=utf-8" / >

3. Definition of Document in the header


<script type="text/javascript"> 
if(document.charset){ 
 alert(document.charset+"!!!!"); 
 document.charset = 'GBK'; 
 alert(document.charset); 
} 
else if(document.characterSet){ 
 alert(document.characterSet+"????"); 
 document.characterSet = 'GBK'; 
 alert(document.characterSet); 
} 

When obtaining the source page code, judge these three parts of data in turn, from front to back, and the priority is the same.
None of the above three did not encode information 1 by chardet and other third party web coding intelligent recognition tools

Installation: pip install chardet

Official website: http: / / chardet readthedocs. io en/latest/usage html

Python chardet character encoding judgment

Using chardet is a convenient way to implement string/file encoding detection. Although HTML pages have charset tags, sometimes this is not true. Then chardet can help us a lot.
chardet instance


import urllib 
rawdata = urllib.urlopen('//www.ofstack.com/').read() 
import chardet 
chardet.detect(rawdata) 
{'confidence': 0.99, 'encoding': 'GB2312'} 

chardet can directly use the detect function to detect the encoding of a given character. The return value of the function is a dictionary, with two elements, one for the detected credibility and one for the detected encoding.

How to deal with Chinese character coding ? in the process of developing self-use crawler;
What is said below is for python2.7. If it is not processed, all the codes collected are scrambled. The solution is to process html into the utf-8 code of unified 1 and encounter windows-1252 code, which belongs to the chardet code recognition training is not completed


import chardet 
a='abc' 
type(a) 
str 
chardet.detect(a) 
{'confidence': 1.0, 'encoding': 'ascii'} 
 
 
a =" I " 
chardet.detect(a) 
{'confidence': 0.73, 'encoding': 'windows-1252'} 
a.decode('windows-1252') 
u'\xe6\u02c6\u2018' 
chardet.detect(a.decode('windows-1252').encode('utf-8')) 
type(a.decode('windows-1252')) 
unicode 
type(a.decode('windows-1252').encode('utf-8')) 
str 
chardet.detect(a.decode('windows-1252').encode('utf-8')) 
{'confidence': 0.87625, 'encoding': 'utf-8'} 
 
 
a =" I'm Chinese " 
type(a) 
str 
{'confidence': 0.9690625, 'encoding': 'utf-8'} 
chardet.detect(a) 
# -*- coding:utf-8 -*- 
import chardet 
import urllib2 
# Scraping of the page html 
html = urllib2.urlopen('//www.ofstack.com/').read() 
print html 
mychar=chardet.detect(html) 
print mychar 
bianma=mychar['encoding'] 
if bianma == 'utf-8' or bianma == 'UTF-8': 
 html=html.decode('utf-8','ignore').encode('utf-8') 
else: 
 html =html.decode('gb2312','ignore').encode('utf-8') 
print html 
print chardet.detect(html) 

Encoding of the python code file
The py file is ASCII by default. When the Chinese version is displayed, a conversion from ASCII to the default code of the system will be performed. Then the error will occur: SyntaxError: Non-ASCII character. You need to add coding instructions on line 1 of the code file:


# -*- coding:utf-8 -*- 
 
print ' Chinese ' 

Strings entered directly as above are handled according to the code file encoding 'utf-8'
If unicode is used, the following methods:

s1 = u' Chinese '#u means to store information in unicode encoding

decode is a method that any string has that converts the string to unicode, and the parameter indicates the encoding format of the source string.
encode is also a method that any string has, converting the string to the format specified by the argument.

Please refer to the topic "python crawling function summary" for more information.


Related articles: