The use of urllib on the basis of python crawler

  • 2021-09-05 00:36:06
  • OfStack

1. Relationship between urllib and urllib2

In python2, urllib and urllib2 are mainly used, while python3 reconstructs urllib and urllib2 and splits them into several sub-modules such as urllib. request, urllib. parse, urllib. error and urllib. robotparser. This architecture is more reasonable in logic and structure. urllib library does not need to be installed, but python3 comes with it. python 3. x merges the urllib library and the urilib2 library into the urllib library.

urllib2.urlopen () becomes urllib. request. urlopen ()
urllib2.Request () becomes urllib. request. Request ()
Replace cookielib in python2 with http. cookiejar.
import http. cookiejar instead of import cookielib
urljoin now corresponds to the function urllib. parse. urljoin

2. urllib Library under python3

request, which is the most basic HTTP request module, we can use it to simulate sending 1 request, only need to pass URL to the library method and additional parameters, we can simulate this process. error, the exception handling module, if a request error occurs, we can catch these exceptions, and then retry or other operations to ensure that the program will not terminate unexpectedly. parse is a tool module that provides many URL processing methods, such as splitting, parsing, merging and so on. robotparser is mainly used to identify the website robots. txt file, and then judge which websites can be crawled, which websites can not crawl, in fact, less used.

3. Basic classes for request

(1) request. urlopen

The main parameter of the urlopen method is the url address of the target website, which can be str type or an request object.

The get method requests the following:


from urllib import request,parse
respones = request.urlopen(http://www.baidu.com/)

post method request, need to add data parameter (dictionary format), if it is byte stream encoding format content, that is, bytes type, through bytes () method can be transformed, in addition, if this data parameter is passed, it defaults to GET mode request without adding data parameter.


from urllib import request,parse
url = "http://www.baidu.com/"
wd = {'wd':' Whoa, haha, ha '}
data = bytes(parse.urlencode(wd),'utf-8')
respones = request.urlopen(url,data=data)

(2) request. Request

Because User-Agent, Cookie and other headers information can not be added by using urlopen () method alone, it is necessary to build an Request type object. By constructing this data structure, on the one hand, we can separate the request into an object, and on the other hand, the configurable parameters are richer and more flexible. The main parameters are:

The url parameter is the request URL, which is required and the others are optional. If the data parameter is to be transmitted, it must be of bytes (byte stream) type. If it is a dictionary, it can be encoded with urlencode () in urllib. parse module first. headers parameter is a dictionary, this is Request Headers, you can construct Request directly through headers parameter when constructing Request, or you can add it by calling add_header () method of Request instance. The most commonly used usage of Request Headers is to disguise the browser by modifying User-Agent, and the default User-Agent is Python-urllib The origin_req_host parameter refers to the requester's host name or IP address. The unverifiable parameter refers to whether the request is unverifiable, and the default is False. This means that the user does not have sufficient authority to choose the result of receiving this request. For example, if we request an image from an HTML document, but we do not have permission to automatically capture the image, the value of unverifiable is True. The method parameter is a string that indicates the method used by the request, such as GET, POST, PUT, and so on.

By random method, select user-agent:


import randomUA_LIST = [
  'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
  'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
  'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; Acoo Browser; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Avant Browser)',
  'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
  'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
  'Mozilla/4.0 (compatible; Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729); Windows NT 5.1; Trident/4.0)',
  'Mozilla/4.0 (compatible; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; Acoo Browser; .NET CLR 1.1.4322; .NET CLR 2.0.50727); Windows NT 5.1; Trident/4.0; Maxthon; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.2)',
  'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB6; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'
]

# Random acquisition 1 A user-agent
user_agent = random.choice(UA_LIST)

Method 1 for adding headers header information:


url='http://www.baidu.com/'user_agent = random.choice(UA_LIST)
headers = {
  'User-Agent': user_agent
}
req = request.Request(url=url,headers=headers)
respones = request.urlopen(req)

Method 2 for adding headers header information:


url='http://www.baidu.com'
headers = {
  'User-Agent': user_agent
}
# Add user-agent Method of 2
req = request.Request(url)
# Request to add user-agent
req.add_header("User-Agent",user_agent)
# Object of the request user-agent agent Adj. a To be lowercase 
print(req.get_header("User-agent"))
response = request.urlopen(req)print(respones.read().decode('utf-8'))

3. Advanced classes for request

The BaseHandler class in the urllib. request module, which is the parent class of all the other Handler, is a processor that handles, for example, login authentication, cookies, proxy settings, redirects, and so on. It provides methods used directly and by derived classes:

add_parent (director): Add director as parent class close (): Close its parent class parent (): Open to use a different protocol or handle errors defautl_open (req): Captures all URL and its subclasses and calls before the protocol is opened

Subclasses of Handler include:

HTTPDefaultErrorHandler: Used to handle an http response error that throws an exception of the HTTPError class
HTTPRedirectHandler: Used to handle redirects
HTTPCookieProcessor: Used to process cookies
ProxyHandler: Used to set proxy, default proxy is empty
HTTPPasswordMgr: Always manage passwords, which maintains a user name and password table
HTTPBasicAuthHandler: User-managed authentication, which can be used to implement authentication functions if authentication is required when a link is opened

(1) ProxyHandler

If the crawler needs to crawl a lot of website data, in order to avoid being blocked, it needs to use a proxy. Generate an opener object through request.build_opener () method. The method of adding proxy is as follows:


from urllib import request

# Proxy switch, indicating whether to turn on the proxy 
proxyswitch =True

# Build 1 A handler Processor object with the parameters of 1 Dictionary types, including proxy types and proxy servers IP+PORT
proxyhandler = request.ProxyHandler({"http":"191.96.42.80:3128"})
# If it is a proxy with user name and password, the format is {"http":"username:passwd@191.96.42.80:3128"}

# Unrepresented handler Processor object 
nullproxyhandler = request.ProxyHandler()

if proxyswitch:
  opener = request.build_opener(proxyhandler)
else:
  opener = request.build_opener(nullproxyhandler)

req = request.Request("http://www.baidu.com/")

response = opener.open(req)

print(response.read().decode("utf-8"))

(2) ProxyBasicAuthHandler

Realization of proxy server function by password manager


from urllib import request
# Proxy password management, can also manage server account passwords 

# Account password 
user = "username"
passwd = "passwd"

# Proxy server 
proxyserver = "1.1.1.1:9999"

# Build a password management object to save the user name and password to be processed 
passmgr = request.HTTPPasswordMgrWithDefaultRealm()

# Add account information, section 1 Parameters realm Is domain information related to the remote server 
passmgr.add_password(None,proxyserver,user,passwd)

# Build the foundation ProxyBasicAuthHandler Processor object 
proxyauth_handler = request.ProxyBasicAuthHandler(passmgr)

opener = request.build_opener(proxyauth_handler)

req = request.Request("http://www.baidu.com/")

response = opener.open(req)

(3) ProxyBasicAuthHandler

Realization of web authentication login function by password manager


#web Validation 
from urllib import request

test = "test"
passwd = "123456"

webserver = "1.1.1.1"

# Build a password manager handler
passwdmgr = request.HTTPPasswordMgrWithDefaultRealm()
# Add password information 
passwdmgr.add_password(None,webserver,test,passwd)

#HTTP Basic authentication processor class 
http_authhandler = request.HTTPBasicAuthHandler(passwdmgr)

opener = request.build_opener(http_authhandler)

req = request.Request("http://"+webserver)

response = opener.open(req)

4. Cookie processing

cookie processor object is constructed by HTTPCookieProcessor in http. cookiejar to process cookie information


import http.cookiejar
from urllib import request,parse
# Simulated landing first post Account password 
# Then save the generated cookie

# Pass CookieJar Class component 1 A coociejar Object , Never save cookie Value 
cookie = http.cookiejar.CookieJar()

# Component cookie Processor object for handling cookie
cookie_handler = request.HTTPCookieProcessor(cookie)

# Component 1 Custom opener
opener = request.build_opener(cookie_handler)

# Through the custom opener Adj. addheaders Parameter, you can add HTTP Header parameter 
opener.addheaders = [("User-Agent","Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)"),]

# Interface to log in 
url = 'http://www.renren.com/PLogin.do'

# Account password to log in 
data = {
  "email":"renren Account number ",
  "password":" Password "
}
# Data processing 
data = bytes(parse.urlencode(data),'utf-8')
# No. 1 1 Second is POST Request, through the login account password, get cookie
req = request.Request(url,data=data)
# Send the first 1 Times POST Request, generate the post-login cookie
response = opener.open(req)

print(response.read().decode("utf-8"))

# At this time opener The under this link is already included cookie In this case, use the opener You can directly visit other web pages under the site without logging in again 
opener.open(http://www.renren.com/PLogin.doxxxxxxxxxxxxx)

The above is the python crawler based on the use of urllib details, more information about python crawler urllib please pay attention to other related articles on this site!


Related articles: