Some common USES of the urllib library in Python2 and 3

  • 2020-06-19 10:40:02
  • OfStack

What is the Urllib library

Urllib is a module provided by Python for operating URL. We often need this library when crawling web pages.

After the upgrade merge, the package location in the module changes a lot.

urllib library reference table

Python2.X

Python3.X

urllib

urllib.request, urllib.error, urllib.parse

urllib2

urllib.request, urllib.error

urllib2.urlopen

urllib.request.urlopen

urllib.urlencode

urllib.parse.urlencode

urllib.quote

urllib.request.quote

urllib2.Request

urllib.request.Request

urlparse

urllib.parse

urllib.urlretrieve

urllib.request.urlretrieve

urllib2.URLError

urllib.error.URLError

cookielib.CookieJar

http.CookieJar

urllib library is used to operate URL, crawling page python third library, the same library requests, httplib2.

In Python2.ES26en, there are urllib and urllib2, but in Python3.ES30en, all 1 is incorporated into urllib. You can see the common changes in the table above, based on which you can quickly write the corresponding version of the python program.

Python3.X is relatively friendly to Chinese than ES37en2.X, so the blog goes on to introduce some common USES of the urllib library via ES39en3.X.

Send the request


import urllib.request
r = urllib.request.urlopen(http://www.python.org/)

The first import urllib.request Module, using urlopen() Send a request to URL in the parameter, and return 1 http.client.HTTPResponse Object.

in urlopen() , using the timeout field, you can set the appropriate number of seconds to stop waiting for a response. In addition, it can also be used r.info() , r.getcode() , r.geturl() Get the corresponding current environment information, status code, current page URL.

Read the response


import urllib.request
url = "http://www.python.org/"
with urllib.request.urlopen(url) as r:
 r.read()

use r.read() Read the response content into memory, the content for the web page source code (available with the appropriate browser "View the web source" function), and can be returned to the corresponding string decoding decode() .

Pass the URL parameter


import urllib.request
import urllib.parse
params = urllib.parse.urlencode({'q': 'urllib', 'check_keywords': 'yes', 'area': 'default'})
url = "https://docs.python.org/3/search.html?{}".format(params)
r = urllib.request.urlopen(url)

In the form of a string dictionary, by urlencode() Encoding, passing data for URL's query string,

The encoded params is a string, and each key-value pair of the dictionary has the value ' & 'the connection: urlopen()0

After build URL: https: / / docs python. org / 3 / search. html & # 63; q = urllib & check_keywords=yes & area=default

Of course, urlopen() Support for directly built URL, simple get requests can be ignored urlencode() Code, manual build after direct request. The above approach makes the code modular and more elegant.

Passing Chinese parameters


import urllib.request
searchword = urllib.request.quote(input(" Please enter the key to query: "))
url = "https://cn.bing.com/images/async?q={}&first=0&mmasync=1".format(searchword)
r = urllib.request.urlopen(url)

The URL is the use of bing image interface, query the keyword q image. If Chinese is passed directly into the URL request, it will result in an encoding error. We need to use quote() , URL encoding for the Chinese keyword, which can be used accordingly unquote() Decode it.

Custom request header


import urllib.request
url = 'https://docs.python.org/3/library/urllib.request.html'
headers = {
 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
 'Referer': 'https://docs.python.org/3/library/urllib.html'
}
req = urllib.request.Request(url, headers=headers)
r = urllib.request.urlopen(req)

Sometimes when crawling 1 page, 403 error (Forbidden) occurs, that is, no access. This is because the web server authenticates the visitor's Headers attribute. For example, requests sent through the urllib library default to User-ES126en/X.Y as ES129en-ES130en, where X is the primary version of Python and Y is the secondary version. So, we need to pass urllib.request.Request() Build the Request object and pass in the dictionary Headers attribute to simulate the browser.

The corresponding Headers information can be obtained through the developer debugging tool of the browser, "check the" function of "Network" TAB to view the corresponding web page, or use packet capture analysis software Fiddler and Wireshark.

In addition to the above methods, can also be used urllib.request.build_opener() or req.add_header() Custom request headers are shown in the official sample.

In Python2.ES150en, urllib module and urllib2 module are usually used at first, because urllib.urlencode() The URL parameter can be encoded, while urllib2.Request() You can build an Request object, customize the request header, and then use it in general urllib2.urlopen() Send a request.

Pass the POST request


import urllib.request
import urllib.parse
url = 'https://passport.cnblogs.com/user/signin?'
post = {
 'username': 'xxx',
 'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)

When we register, log in, and so on, we pass information through the POST form.

At this point, we need to analyze the page structure, build the form data post, and use urlencode() Returns a string and specifies the encoding format of 'ES170en-8', since POSTdata can only be bytes or file object. At last, Request() Object pass postdata, using urlopen() Send a request.

Download remote data locally


import urllib.request
url = "https://www.python.org/static/img/python-logo.png"
urllib.request.urlretrieve(url, "python-logo.png")

When crawling pictures, videos and other remote data, can be used urlretrieve() Download locally.

The first parameter is url to be downloaded, and the second parameter is the storage path after downloading.

This sample downloads python's official website logo to the current directory and returns tuples (filename, headers).

Set the agent IP


import urllib.request
url = "https://www.cnblogs.com/"
proxy_ip = "180.106.16.132:8118"
proxy = urllib.request.ProxyHandler({'http': proxy_ip})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
r = urllib.request.urlopen(url)

Sometimes frequently crawling a web page, will be blocked by the website server IP. At this point, the agent IP can be set using the above method.

First of all, through the online agent IP website to find a usable IP, build ProxyHandler() Object, passing in 'http' and proxy IP as parameters in the form of a dictionary, to set proxy server information. Then build the opener object, passing in the proxy and HTTPHandler classes. through installl_opener() Set opener to global when used urlopen() When a request is sent, the appropriate request is sent using the information previously set.

Exception handling


import urllib.request
import urllib.error
url = "http://www.balabalabala.org"
try:
 r = urllib.request.urlopen(url)
except urllib.error.URLError as e:
 if hasattr(e, 'code'):
  print(e.code)
 if hasattr(e, 'reason'):
  print(e.reason)

You can use the URLError class to handle 1 URL-related exceptions. The import urllib.error After the URLError exception is caught, the exception status code ES228en.code will only occur when the HTTPError exception (URLError subclass) occurs, so it is necessary to determine whether the exception has the attribute code.

The use of Cookie


import urllib.request
import http.cookiejar
url = "http://www.balabalabala.org/"
cjar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
urllib.request.install_opener(opener)
r = urllib.request.urlopen(url)

Cookie maintains inter-session state when a web page is accessed through the stateless protocol HTTP. For example, some websites require login operations. You can login the first time by submitting the POST form. When crawling other sites under the site, you can use Cookie to maintain the login status instead of submitting the form every time.

First, build CookieJar() Object cjar, reuse urlopen()0 The processor processes the cjar and passes urlopen()1 Build opener object, set to global, through urlopen() Send a request.

conclusion


Related articles: