Do you know the rllib library of python

  • 2021-12-12 05:01:33
  • OfStack

Directory urllib library function Urllib library under the basic use of several modules 1. urllib. request module 1. Function 2. Common method parameters description:
Summarize

Function of urllib Library

The urllib library is the HTTP request library built into Python. The upper layer interface provided by the urllib module makes accessing data on www and ftp like accessing local file 1. When we crawl web pages, we often need this library.

Basic use of several modules under Urllib library

1. urllib. request module

1. Functions

The urllib. request module provides the most basic method for constructing HTTP (or other protocols such as FTP) requests, which can be used to simulate a request initiation process of a browser. Using different protocols to obtain URL information. Some of its interfaces can handle basic authentication (Basic Authenticaton), redirections (HTTP redirection), Cookies (browser Cookies), and so on. These interfaces are provided by handlers and openers objects.

2. Common methods

2.1 urlopen () Method

Syntax format:


urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Parameter description: url: the website to be opened; data: The data submitted by Post is None by default. When data is not None, the submission mode of urlopen () is Post; timeout: Setting Web site access timeout

Use case:


import urllib.request#  Equivalent to from urllib import request
response = urllib.request.urlopen('https://www.baidu.com')
print(" View  response  Response information type : ",type(response))
page = response.read()
print(page.decode('utf-8'))

Directly use urllib. request module in the urlopen method to obtain the page, which page data type for bytes type, decode decoding into string type. The object that can be returned by urlopen through the output result is an HTTPResposne type object.

urlopen returns a class file object and provides the following methods:

read() , readline() , readlines() , fileno() , close() These methods are used in exactly the same way as file objects;

info (): Returns 1 httplib. HTTPMessage object representing the header information returned by the remote server; The list of Http Header can be viewed through Quick Reference to Http Headers.

getcode (): Returns the Http status code. If it is an http request, 200 indicates successful completion of the request; 404 indicates that the website address was not found;

geturl (): Returns the true URL of the fetch page. This method is helpful when an urlopen (or an opener object) may have 1 redirect. The retrieved page URL does not have to be the same as the actual requested URL.

Example:


import urllib.request
response = urllib.request.urlopen('https://python.org/')
print(" View  response  Return type of: ",type(response))
print(" View Reaction Address Information : ",response)
print(" View header information 1(http header) : \n",response.info())
print(" View header information 2(http header) : \n",response.getheaders())
print(" Output header attribute information: ",response.getheader("Server"))
print(" View response status information 1(http status) : \n",response.status)
print(" View response status information 2(http status) : \n",response.getcode())
print(" View the response  url  Address: \n",response.geturl())
page = response.read()
print(" Output web page source code :",page.decode('utf-8'))

2.2 Request () Method

Use request () to wrap the request, and then use urlopen () to get the page.

Syntax format:


urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Example:


import urllib.request
url = "https://www.lagou.com/zhaopin/Python/?labelWords=label"
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
'Referer': 'https://www.lagou.com/zhaopin/Python/?labelWords=label',
'Connection': 'keep-alive'
}
req = request.Request(url, headers=headers)
page = request.urlopen(req).read()
page = page.decode('utf-8')
print(page)

Parameter description:

User-Agent This header can carry the following information: browser name and version number, operating system name and version number, default language. This data can be obtained from the request response information on the web development tool (press F12 on the browser to open the development tool). It is used to disguise the browser.

Referer It can be used to prevent chain theft. There are 1 website pictures showing the source https ://***.com, which is identified by checking Referer.

Connection Indicates the connection status and records the status of Session.

origin_req_host host name or IP address of the requester.

unverifiable : Refers to the request that cannot be validated. Defaults to False. The user does not have sufficient permissions to choose to receive the result of this request, such as requesting an image in an HTML document, but does not have the permission to automatically capture the image, when unverifiable is True.

method Specifies the method used by the request, such as GET, POST, PUT, and so on.

Reference: https://www.ofstack.com/article/209542. htm

Summarize

This article is here, I hope to give you help, but also hope that you can pay more attention to this site more content!


Related articles: