Some common USES of the urllib library in Python2 and 3
- 2020-06-19 10:40:02
- OfStack
What is the Urllib library
Urllib is a module provided by Python for operating URL. We often need this library when crawling web pages.
After the upgrade merge, the package location in the module changes a lot.
urllib library reference table
Python2.X |
Python3.X |
urllib |
urllib.request, urllib.error, urllib.parse |
urllib2 |
urllib.request, urllib.error |
urllib2.urlopen |
urllib.request.urlopen |
urllib.urlencode |
urllib.parse.urlencode |
urllib.quote |
urllib.request.quote |
urllib2.Request |
urllib.request.Request |
urlparse |
urllib.parse |
urllib.urlretrieve |
urllib.request.urlretrieve |
urllib2.URLError |
urllib.error.URLError |
cookielib.CookieJar |
http.CookieJar |
urllib library is used to operate URL, crawling page python third library, the same library requests, httplib2.
In Python2.ES26en, there are urllib and urllib2, but in Python3.ES30en, all 1 is incorporated into urllib. You can see the common changes in the table above, based on which you can quickly write the corresponding version of the python program.
Python3.X is relatively friendly to Chinese than ES37en2.X, so the blog goes on to introduce some common USES of the urllib library via ES39en3.X.
Send the request
import urllib.request
r = urllib.request.urlopen(http://www.python.org/)
The first import
urllib.request
Module, using
urlopen()
Send a request to URL in the parameter, and return 1
http.client.HTTPResponse
Object.
in
urlopen()
, using the timeout field, you can set the appropriate number of seconds to stop waiting for a response. In addition, it can also be used
r.info()
,
r.getcode()
,
r.geturl()
Get the corresponding current environment information, status code, current page URL.
Read the response
import urllib.request
url = "http://www.python.org/"
with urllib.request.urlopen(url) as r:
r.read()
use
r.read()
Read the response content into memory, the content for the web page source code (available with the appropriate browser "View the web source" function), and can be returned to the corresponding string decoding
decode()
.
Pass the URL parameter
import urllib.request
import urllib.parse
params = urllib.parse.urlencode({'q': 'urllib', 'check_keywords': 'yes', 'area': 'default'})
url = "https://docs.python.org/3/search.html?{}".format(params)
r = urllib.request.urlopen(url)
In the form of a string dictionary, by
urlencode()
Encoding, passing data for URL's query string,
The encoded params is a string, and each key-value pair of the dictionary has the value '
&
'the connection:
urlopen()
0
After build URL: https: / / docs python. org / 3 / search. html & # 63; q = urllib & check_keywords=yes & area=default
Of course,
urlopen()
Support for directly built URL, simple get requests can be ignored
urlencode()
Code, manual build after direct request. The above approach makes the code modular and more elegant.
Passing Chinese parameters
import urllib.request
searchword = urllib.request.quote(input(" Please enter the key to query: "))
url = "https://cn.bing.com/images/async?q={}&first=0&mmasync=1".format(searchword)
r = urllib.request.urlopen(url)
The URL is the use of bing image interface, query the keyword q image. If Chinese is passed directly into the URL request, it will result in an encoding error. We need to use
quote()
, URL encoding for the Chinese keyword, which can be used accordingly
unquote()
Decode it.
Custom request header
import urllib.request
url = 'https://docs.python.org/3/library/urllib.request.html'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Referer': 'https://docs.python.org/3/library/urllib.html'
}
req = urllib.request.Request(url, headers=headers)
r = urllib.request.urlopen(req)
Sometimes when crawling 1 page, 403 error (Forbidden) occurs, that is, no access. This is because the web server authenticates the visitor's Headers attribute. For example, requests sent through the urllib library default to User-ES126en/X.Y as ES129en-ES130en, where X is the primary version of Python and Y is the secondary version. So, we need to pass
urllib.request.Request()
Build the Request object and pass in the dictionary Headers attribute to simulate the browser.
The corresponding Headers information can be obtained through the developer debugging tool of the browser, "check the" function of "Network" TAB to view the corresponding web page, or use packet capture analysis software Fiddler and Wireshark.
In addition to the above methods, can also be used
urllib.request.build_opener()
or
req.add_header()
Custom request headers are shown in the official sample.
In Python2.ES150en, urllib module and urllib2 module are usually used at first, because
urllib.urlencode()
The URL parameter can be encoded, while
urllib2.Request()
You can build an Request object, customize the request header, and then use it in general
urllib2.urlopen()
Send a request.
Pass the POST request
import urllib.request
import urllib.parse
url = 'https://passport.cnblogs.com/user/signin?'
post = {
'username': 'xxx',
'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)
When we register, log in, and so on, we pass information through the POST form.
At this point, we need to analyze the page structure, build the form data post, and use
urlencode()
Returns a string and specifies the encoding format of 'ES170en-8', since POSTdata can only be bytes or file object. At last,
Request()
Object pass postdata, using
urlopen()
Send a request.
Download remote data locally
import urllib.request
url = "https://www.python.org/static/img/python-logo.png"
urllib.request.urlretrieve(url, "python-logo.png")
When crawling pictures, videos and other remote data, can be used
urlretrieve()
Download locally.
The first parameter is url to be downloaded, and the second parameter is the storage path after downloading.
This sample downloads python's official website logo to the current directory and returns tuples (filename, headers).
Set the agent IP
import urllib.request
url = "https://www.cnblogs.com/"
proxy_ip = "180.106.16.132:8118"
proxy = urllib.request.ProxyHandler({'http': proxy_ip})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
r = urllib.request.urlopen(url)
Sometimes frequently crawling a web page, will be blocked by the website server IP. At this point, the agent IP can be set using the above method.
First of all, through the online agent IP website to find a usable IP, build
ProxyHandler()
Object, passing in 'http' and proxy IP as parameters in the form of a dictionary, to set proxy server information. Then build the opener object, passing in the proxy and HTTPHandler classes. through
installl_opener()
Set opener to global when used
urlopen()
When a request is sent, the appropriate request is sent using the information previously set.
Exception handling
import urllib.request
import urllib.error
url = "http://www.balabalabala.org"
try:
r = urllib.request.urlopen(url)
except urllib.error.URLError as e:
if hasattr(e, 'code'):
print(e.code)
if hasattr(e, 'reason'):
print(e.reason)
You can use the URLError class to handle 1 URL-related exceptions. The import
urllib.error
After the URLError exception is caught, the exception status code ES228en.code will only occur when the HTTPError exception (URLError subclass) occurs, so it is necessary to determine whether the exception has the attribute code.
The use of Cookie
import urllib.request
import http.cookiejar
url = "http://www.balabalabala.org/"
cjar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
urllib.request.install_opener(opener)
r = urllib.request.urlopen(url)
Cookie maintains inter-session state when a web page is accessed through the stateless protocol HTTP. For example, some websites require login operations. You can login the first time by submitting the POST form. When crawling other sites under the site, you can use Cookie to maintain the login status instead of submitting the form every time.
First, build
CookieJar()
Object cjar, reuse
urlopen()
0
The processor processes the cjar and passes
urlopen()
1
Build opener object, set to global, through
urlopen()
Send a request.
conclusion