Eight usage details of the urllib2 module in Python are Shared

  • 2020-04-02 14:30:20
  • OfStack

There are many useful utility classes in the Python standard library, but the details of how to use them are not clear in the standard library documentation, such as urllib2, an HTTP client library. Here is a summary of some of the urllib2 library usage details.

1 setup of Proxy

By default, urllib2 USES the environment variable http_proxy to set the HTTP Proxy. If you want to explicitly control the Proxy in your program, regardless of the environment variables, you can use the following approach


import urllib2
 
enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
 
if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)
 
urllib2.install_opener(opener)

One detail to note here is that using urllib2.install_opener() sets the global opener of urllib2. This is convenient for later use, but does not allow for more granular control, such as wanting to use two different Proxy Settings in your program. It is better to change the global Settings without using install_opener, and simply call the open method of the opener instead of the global urlopen method.

2 the Timeout setting

In the old version, the urllib2 API did not expose the Timeout setting. To set the Timeout value, you can only change the Socket global Timeout value.


import urllib2
import socket
 
socket.setdefaulttimeout(10) # 10 Timeout after seconds
urllib2.socket.setdefaulttimeout(10) # Another way

In the new Python 2.6 version, timeouts can be set directly with the timeout parameter of urllib2.urlopen().

import urllib2
response = urllib2.urlopen('http://www.google.com', timeout=10)

Add a specific Header to the HTTP Request
To add a Header, use the Request object:

import urllib2
 
request = urllib2.Request(uri)
request.add_header('User-Agent', 'fake-client')
response = urllib2.urlopen(request)

There are some headers to be careful of, and the Server side will check for them

1. Some servers or proxies check this value to determine whether it is a Request made by the browser
When using the REST interface, the Server checks this value to determine how the Content in the HTTP Body should be parsed.

Common values are:

1. Application/XML: used in XML RPC, such as RESTful/SOAP calls
2. Application /json: used during json RPC calls
3. Application /x-www-form-urlencoded: used by a browser when submitting a Web form
...

When using RPC to invoke RESTful or SOAP services provided by Server, the wrong content-type setting causes the Server to reject the service.

4 Redirect

Urllib2 automatically does Redirect by default for 3xx HTTP return code without manual configuration. To detect if a Redirect action has occurred, simply check that the URL for Response and the URL for Request are the same.


import urllib2
response = urllib2.urlopen('http://www.google.cn')
redirected = response.geturl() == 'http://www.google.cn'

If you don't want to Redirect automatically, you can use a custom HTTPRedirectHandler class in addition to the lower level httplib library.

import urllib2
 
class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        pass
    def http_error_302(self, req, fp, code, msg, headers):
        pass
 
opener = urllib2.build_opener(RedirectHandler)
opener.open('http://www.google.cn')

5 the Cookie

The processing of cookies by urllib2 is also automatic. If you need to get the value of a Cookie item, you can do this:


import urllib2
import cookielib
 
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.google.com')
for item in cookie:
    if item.name == 'some_cookie_item_name':
        print item.value

6 use the HTTP PUT and DELETE methods
Urllib2 only supports HTTP GET and POST methods. If you want to use HTTP PUT and DELETE, you can only use the lower httplib library. Nevertheless, we can make urllib2 emit HTTP PUT or DELETE packets by:

import urllib2
 
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

This is a Hack, but it's not a problem in practice.

Get the HTTP return code

For 200 OK, simply use the getcode() method of the response object returned by urlopen to get the HTTP return code. But for other return codes, urlopen throws an exception. At this point, it is time to check the code property of the exception object:


import urllib2
try:
    response = urllib2.urlopen('http://restrict.web.com')
except urllib2.HTTPError, e:
    print e.code

8 the Debug Log

When using urllib2, you can turn on the Debug Log by the following method, so that the contents of receiving and sending packets will be printed on the screen, which is convenient for debugging, and to some extent can save the work of grabbing packets.


import urllib2
 
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
 
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com')


Related articles: