Zero based writing of python crawler HTTP exception handling

  • 2020-04-02 14:20:01
  • OfStack

Let's start with HTTP exception handling.
UrlError occurs when urlopen cannot handle a response.
However, the usual Python APIs exceptions such as ValueError,TypeError and so on will also be generated at the same time.
HTTPError is a subclass of urlError that is typically generated in a particular HTTP url.

1. URLError
Typically, urlerrors occur when there is no network connection (no route to a particular server), or when the server does not exist.
In this case, the exception also has the "reason" attribute, which is a tuple,
Contains an error number and an error message.
Let's create urllib2_test06.py to feel the exception handling:


import urllib2 
req = urllib2.Request('http://www.baibai.com') 
try: urllib2.urlopen(req) 
except urllib2.URLError, e:   
    print e.reason 

Press F5, you can see the printed content is:
[11001] Errno getaddrinfo failed
That is, the error number is 11001 and the content is getaddrinfo failed

2. The HTTPError

Each HTTP reply object response on the server contains a numeric "status code".
Sometimes the status code indicates that the server cannot complete the request. The default processor will handle some of this response for you.
For example, if response is a "redirect" that requires the client to fetch the document from another address, urllib2 will handle it for you.
Otherwise, urlopen will generate an HTTPError.
Typical errors include "404"(page not found), "403"(request disabled), and "401"(with validation request).
The HTTP status code represents the status of the response returned by the HTTP protocol.
For example, if the client sends a request to the server, if the requested resource is successfully obtained, the status code returned is 200, indicating a successful response.
If the requested resource does not exist, a 404 error is typically returned.

The HTTP status code It is usually divided into 5 types, starting with 1 to 5 digits and consisting of 3 integers:
------------------------------------------------------------------------------------------------
200: request successful           Handling: get the content of the response and process it
201: the request completes, resulting in the creation of a new resource. The URI for the newly created resource is available in the responding entity       Handling: not encountered in crawler
202: the request is accepted, but processing is not complete       Handling: block wait
204: the server side has implemented the request, but no new message is returned. If the customer is a user agent, there is no need to update its own document view for this purpose.       Disposal: discard
300: this status code is not used directly by HTTP/1.0 applications and is only used as a default interpretation of 3XX type responses. There are multiple requested resources available.       Handling: if the program can handle, further processing, if the program can't handle, discarded
301: the requested resource is assigned a permanent URL so that it can be accessed in the future       Handling: redirect to the assigned URL
302: the requested resource is temporarily saved at a different URL         Handling: redirect to a temporary URL
The resource requested by 304 was not updated         Disposal: discard
400 illegal request         Disposal: discard
401 unauthorized         Disposal: discard
403 ban         Disposal: discard
404 not found         Disposal: discard
The status code starting with "5" in the 5XX response code indicates that the server side has found an error and cannot continue to execute the request       Disposal: discard
------------------------------------------------------------------------------------------------

The HTTPError instance is generated with an integer 'code' property, which is the relevant error number sent by the server.

Error Codes

Because the default processor handles redirects (Numbers other than 300), and Numbers in the 100-299 range indicate success, you can only see the 400-599 error number.
BaseHTTPServer. BaseHTTPRequestHandler. The response is a useful response number dictionary, shows the HTTP protocol to use all of the response.
When an error number is generated, the server returns an HTTP error number, and an error page.
You can use the HTTPError instance as the response object returned from the page.
This means that, like the error property, it also contains the read,geturl, and info methods.
Let's build urllib2_test07.py to feel it:


import urllib2 
req = urllib2.Request('//www.jb51.net/callmewhy') 
try: 
    urllib2.urlopen(req) 
except urllib2.URLError, e: 
    print e.code 
    #print e.read() 

Press F5 to see a 404 error code, which means the page was not found.

3. The Wrapping

So if you want to prepare for HTTPError or URLError, there are two basic ways. The second is recommended.
Let's build urllib2_test08.py to demonstrate the first exception handling scheme:


from urllib2 import Request, urlopen, URLError, HTTPError 
req = Request('//www.jb51.net/callmewhy') 
try: 
    response = urlopen(req) 
except HTTPError, e: 
    print 'The server couldn't fulfill the request.' 
    print 'Error code: ', e.code 
except URLError, e: 
    print 'We failed to reach a server.' 
    print 'Reason: ', e.reason 
else: 
    print 'No exception was raised.' 
    # everything is fine 

Like other languages, try then catches the exception and prints its contents.
One thing to note here is that except HTTPError must be in the first, otherwise except URLError will also accept HTTPError.
Because HTTPError is a subclass of URLError, it catches all urlerrors (including HTTPError) if URLError is in the front.

Let's build urllib2_test99.py to demonstrate the second exception handling scheme:


from urllib2 import Request, urlopen, URLError, HTTPError 
req = Request('//www.jb51.net/callmewhy') 
try:   
    response = urlopen(req)   
except URLError, e:   
    if hasattr(e, 'code'):   
        print 'The server couldn't fulfill the request.'   
        print 'Error code: ', e.code   
    elif hasattr(e, 'reason'):   
        print 'We failed to reach a server.'   
        print 'Reason: ', e.reason   
else:   
    print 'No exception was raised.'   
    # everything is fine   


Related articles: