Python USES urllib2 to get examples of network resources

  • 2020-04-02 13:15:21
  • OfStack

This is the ability to obtain URLs using different protocols, and it also provides a more complex interface to handle common situations such as basic authentication, cookies, proxies, and others.
They are provided through handlers and openers objects.
Urllib2 supports getting URLs in different formats (strings defined before the ":" of the URL, e.g. "FTP" is the prefix for "FTP :python.ort/"), which take advantage of their associated network protocols (e.g. FTP,HTTP)
Fetch. This tutorial focuses on the most common use -HTTP.
For simple applications, urlopen is very easy to use. But when you encounter an error or exception while opening an HTTP url, you'll need some understanding of the hypertext transfer protocol (HTTP).
HTTP document, of course, is the most authoritative RFC 2616 (http://rfc.net/rfc2616.html). This is a technical document, so it is not easy to read. The purpose of this HOWTO tutorial is to show you HOWTO use urllib2,
And provide enough HTTP details to help you understand. It is not a documentation of urllib2, but a supporting role.
For URLs

The easiest way to use urllib2 is as follows


import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

Many of urllib2's applications are that simple (remember, in addition to "HTTP :", urls can be" FTP :","file:", etc.). But this article teaches a more complex application of HTTP.
HTTP is based on a request and reply mechanism -- the client makes the request and the server provides the reply. Urllib2 USES a Request object to map your HTTP Request, and in its simplest form you will use what you are requesting
The address creates a Request object, by calling urlopen and passing in the Request object, it returns a related Request response object, which is like a file object, so you can call.read() in response.


import urllib2
req = urllib2.Request('//www.jb51.net')
response = urllib2.urlopen(req)
the_page = response.read()

Remember that urllib2 handles all URL headers using the same interface. For example, you can create an FTP request as shown below.


req = urllib2.Request('ftp://example.com/')

There are two additional things that allow you to do with HTTP requests. The first is that you can send data form data, and the second is that you can send additional information about the data or send itself ("metadata") to the server, which is sent as the "headers" of HTTP.
Let's take a look at how these are sent.
The Data of Data
Sometimes you want to send some data to a URL(usually linked to a CGI[universal gateway interface] script, or some other WEB application). In HTTP, this is often sent using a well-known POST request. This is usually done by your browser when you submit an HTML form.
Not all POSTs come from forms; you can use POST to submit arbitrary data to your own application. For a normal HTML form, data needs to be encoded in standard form. It is then passed to the Request object as a data parameter. The coding work USES the urllib function instead
Urllib2.


import urllib
import urllib2
url = '//www.jb51.net'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

Remember sometimes need other encoding (for example from the HTML upload file - see http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 HTML Specification, the Form Submission details).
If ugoni does not pass a data parameter, urllib2 USES a GET request. The difference between GET and POST requests is that POST requests often have "side effects" that change the state of the system in some way (such as submitting piles of garbage to your door).
While the HTTP standard makes it clear that POSTs generally have side effects and GET requests don't, there's nothing to stop a GET request from having side effects, and a POST request may not have side effects either. Data can also be requested at Get
The URL itself is encoded above.
Consider the following example


 >>> import urllib2 
 >>> import urllib 
 >>> data = {} 
 >>> data['name'] = 'Somebody Here' 
 >>> data['location'] = 'Northampton' 
 >>> data['language'] = 'Python' 
 >>> url_values = urllib.urlencode(data) 
 >>> print url_values 
 name=Somebody+Here&language=Python&location=Northampton 
 >>> url = '//www.jb51.net' 
 >>> full_url = url + '?' + url_values 
 >>> data = urllib2.open(full_url) 

Headers
We'll discuss specific HTTP headers here to show you how to add headers to your HTTP request.
Some sites don't like to be accessed by programs (non-human visitors) or send different versions of content to different browsers. The default urllib2 USES itself as "python-urllib /x.y" (x and y are Python major and minor versions, such as python-urllib /2.5),
This identity may confuse the site or not work at all. The browser identifies itself through the user-agent header. When you create a request object, you can give it a dictionary containing the header data. The following example sends the same content as above, but puts itself
Simulate as Internet Explorer.


import urllib
import urllib2
url = '//www.jb51.net'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

The response reply object also has two useful methods. Look at the sections info and geturl below to see what happens when an error occurs.
Handle Exceptions
UrlError occurs when urlopen cannot handle a response (although common Python APIs exceptions such as ValueError,TypeError, etc., are also generated).
HTTPError is a subclass of urlError that is typically generated in a particular HTTP url.
URLError
Typically, urlerrors occur when there is no network connection (no route to a particular server), or when the server does not exist. In this case, the exception also has the "reason" attribute, which is a tuple with an error number and an error message.
For example,


>>> req = urllib2.Request('//www.jb51.net')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>>    print e.reason
>>>
(4, 'getaddrinfo failed')

HTTPError
Each HTTP reply object response on the server contains a numeric "status code". Sometimes the status code indicates that the server cannot complete the request. The default processor handles some of this response for you (e.g., if the response is a "redirect", the client needs to fetch the document from another address
, urllib2 will handle it for you). Otherwise, urlopen will generate an HTTPError. Typical errors include "404"(page not found), "403"(request disabled), and "401"(with validation request).
See section 10 of RFC 2616 for all the HTTP error codes
The HTTPError instance is generated with an integer 'code' property, which is the relevant error number sent by the server.
Error Codes
Because the default processor handles redirects (Numbers other than 300), and Numbers in the 100-299 range indicate success, you can only see the 400-599 error number.
BaseHTTPServer. BaseHTTPRequestHandler. The response is a useful response number dictionary, shows the RFC 2616 to use all of the response. The dictionary is redisplayed here for convenience. (translator omitted)
When an error number is generated, the server returns an HTTP error number, and an error page. You can use the HTTPError instance as the response object returned from the page. This means that, like the error property, it also contains the read,geturl, and info methods.


>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
>>>     urllib2.urlopen(req)
>>> except URLError, e:
>>>     print e.code
>>>     print e.read()
>>>


404
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd">
<?xml-stylesheet href="./css/ht2html.css"
  type="text/css"?>
<html><head><title>Error 404: File Not Found</title>
...... etc...

Do it Up
So if you want to prepare for HTTPError or URLError, there are two basic ways. I prefer the second.
The first:


from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
    response = urlopen(req)
except HTTPError, e:
    print 'The server couldn't fulfill the request.'
    print 'Error code: ', e.code
except URLError, e:
    print 'We failed to reach a server.'
    print 'Reason: ', e.reason
else:
    # everything is fine

Note: except HTTPError must be in the first, otherwise except URLError will also receive HTTPError.
The second:


from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError, e:
    if hasattr(e, 'reason'):
        print 'We failed to reach a server.'
        print 'Reason: ', e.reason
    elif hasattr(e, 'code'):
        print 'The server couldn't fulfill the request.'
        print 'Error code: ', e.code
else:
    # everything is fine

The info and geturl
The response object returned by urlopen (or HTTPError instance) has two useful methods info() and geturl()
Geturl - this returns the actual URL retrieved, which is useful because urlopen(or opener object) may be used
There will be a redirect. The URL obtained may be different from the request URL.
Info - the dictionary object that returns the object that describes the retrieved page. This is usually a specific headers sent by the server. The current instance is httplib.httpmessage.
Classic headers include "content-length," "content-type," and others. See Quick Reference to HTTP Headers(http://www.cs.tut.fi/~jkorpela/http.html)
Gets a list of useful HTTP headers and their explanatory meanings.
Openers and Handlers
When you get a URL you use a opener(an instance of urllib2.openerdirector, urllib2.openerdirector might be a little confusing in name). Normally, we
Use the default opener -- with urlopen, but you can create personalized openers, which use handler handlers and handle all the "heavy" work. Each handler knows
How to open URLs with a particular protocol, or how to handle aspects of URL opening, such as HTTP redirects or HTTP cookies.
If you want to get URLs with a particular processor you'll want to create an openers, such as getting a opener that handles cookies, or getting an opener that doesn't redirect.
To create a opener, instantiate an OpenerDirector, and the call keeps calling.add_handler(some_handler_instance).
Again, you can use build_opener, which is a much more convenient function to create the opener object with only one function call.
Build_opener adds several handlers by default, but provides a quick way to add or update the default handler.
Other handler handlers you might want to handle agents, validations, and other common but somewhat special cases.
Install_opener is used to create the (global) default opener. This means that calling urlopen will use the opener you installed.
Opener objects have an open method that can be used directly to get urls like the urlopen function: you don't usually need to call install_opener, except for convenience.


Related articles: