A summary of some of the usage details of the Python standard library urllib2

  • 2020-04-02 14:42:04
  • OfStack

There are many useful utility classes in the Python standard library, but the details of how to use them are not clear in the standard library documentation, such as urllib2, an HTTP client library. Here are some of the details of using urllib2.

1. The Proxy Settings
2. The Timeout setting
Add a specific Header to the HTTP Request
4. Redirect
5. The Cookie
6. Use HTTP's PUT and DELETE methods
7. Get the HTTP return code
8. The Debug Log

The Proxy Settings

By default, urllib2 USES the environment variable http_proxy to set the HTTP Proxy. If you want to explicitly control the Proxy in your program without being influenced by environment variables, you can use the following approach


import urllib2
enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
 
if enable_proxy:
opener = urllib2.build_opener(proxy_handler)
else:
opener = urllib2.build_opener(null_proxy_handler)
 
urllib2.install_opener(opener)

One detail to note here is that using urllib2.install_opener() sets the global opener of urllib2. This is convenient for later use, but does not allow for more granular control, such as wanting to use two different Proxy Settings in your program. It is better to change the global Settings without using install_opener, and simply call the open method of the opener instead of the global urlopen method.

The Timeout setting

In the old Python version, the urllib2 API did not expose the Timeout setting. To set the Timeout value, you can only change the Socket global Timeout value.


import urllib2
import socket
socket.setdefaulttimeout(10) # 10 Timeout after seconds
urllib2.socket.setdefaulttimeout(10) # Another way

After Python 2.6, timeouts can be set directly with the timeout parameter of urllib2.urlopen().


import urllib2
response = urllib2.urlopen('http://www.google.com', timeout=10)

Add a specific Header to the HTTP Request

To add a header, use the Request object:


import urllib2
request = urllib2.Request(uri)
request.add_header('User-Agent', 'fake-client')
response = urllib2.urlopen(request)

There are some headers that you should be especially careful about, and the server will check for them

User-agent: some servers or proxies use this value to determine whether the request was made by the browser

Content-type: when using the REST interface, the server checks this value to determine how the Content in the HTTP Body should be parsed. Common values are:

Application/XML: used in XML RPC, such as RESTful/SOAP calls
Application /json: used during json RPC calls
Application /x-www-form-urlencoded: used by a browser when submitting a Web form

When using RESTful or SOAP services provided by the server, the wrong content-type setting can cause the server to reject the service

Redirect

Urllib2 automatically does redirect by default for HTTP 3XX return code without manual configuration. To detect if a redirect action has occurred, simply check that the URL for Response and the URL for Request are the same.


import urllib2
response = urllib2.urlopen('http://www.google.cn')
redirected = response.geturl() == 'http://www.google.cn'

If you don't want to redirect automatically, you can customize the HTTPRedirectHandler class in addition to using the lower level httplib library.


import urllib2
 
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
pass
def http_error_302(self, req, fp, code, msg, headers):
pass
 
opener = urllib2.build_opener(RedirectHandler)
opener.open('http://www.google.cn')

Cookie

The processing of cookies by urllib2 is also automatic. If you need to get the value of a Cookie item, you can do this:


import urllib2
import cookielib
 
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.google.com')
for item in cookie:
if item.name == 'some_cookie_item_name':
print item.value

Use the HTTP PUT and DELETE methods

Urllib2 only supports HTTP GET and POST methods. If you want to use HTTP PUT and DELETE, you can only use the lower httplib library. However, we can still make the urllib2 request to PUT or DELETE by:


import urllib2
 
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

This is a Hack, but it's not a problem in practice.

Get the HTTP return code

For 200 OK, simply use the getcode() method of the response object returned by urlopen to get the HTTP return code. But for other return codes, urlopen throws an exception. At this point, it is time to check the code property of the exception object:


import urllib2
try:
response = urllib2.urlopen('http://restrict.web.com')
except urllib2.HTTPError, e:
print e.code
Debug Log

When using urllib2, you can turn on the debug Log in the following way, so that the contents of receiving and sending packets will be printed on the screen, which is convenient for debugging and can sometimes save the task of grabbing packets


import urllib2
 
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
 
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com')


Related articles: