python crawler tutorial elegant HTTP library requests (ii)

  • 2020-06-01 10:19:19
  • OfStack

preface

urllib, urllib2, urllib3, httplib, httplib2 are all Python modules related to HTTP. The names of these modules are very un-human. What's worse, these modules are very different from Python2 and Python3.

Fortunately, there is an amazing HTTP library called requests, which is one of the most followed Python projects in GitHUb. requests was written by Kenneth Reitz.

requests implements most of the features of the HTTP protocol, including Keep-Alive, connection pooling, Cookie persistence, automatic content extraction, HTTP proxy, SSL authentication, connection timeout, Session and many other features. Most importantly, it is compatible with both python2 and python3. The installation of requests can be done directly using the pip method: pip install requests

Send the request


>>> import requests
# GET  request 
>>> response = requests.get(https://foofish.net)

Response content

The value returned by the request is an Response object. The Response object is the encapsulation of the response data returned by the server to the browser in the HTTP protocol. The main elements of the response include: status code, reason phrase, response header, response body, etc. These attributes are all encapsulated in the Response object.


#  Status code 
>>> response.status_code
200

#  The reason the phrase 
>>> response.reason
'OK'

#  In response to the first 
>>> for name,value in response.headers.items():
...  print("%s:%s" % (name, value))
...
Content-Encoding:gzip
Server:nginx/1.10.2
Date:Thu, 06 Apr 2017 16:28:01 GMT

#  Response content 
>>> response.content

'<html><body> Is omitted 1 swastika ...</body></html>

In addition to supporting the GET request, requests also supports all other methods in the HTTP specification, including the POST, PUT, DELTET, HEADT, OPTIONS methods.


>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

Query parameters

Many URL have a long string of parameters, which we call URL's query parameters, using "?" Attached to the URL link between multiple parameters & "Separated, such as: http: / / fav foofish. net / & # 63; p = 4 & s=20, now you can use the dictionary to build query parameters:


>>> args = {"p": 4, "s": 20}
>>> response = requests.get("http://fav.foofish.net", params = args)
>>> response.url
'http://fav.foofish.net/?p=4&s=2'

The request first

requests can simply specify the request header field Headers, for example, sometimes specifying User-Agent to send the request disguised as a browser to fool the server. Simply pass a dictionary object to the headers parameter.


>>> r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})

Request body

requests is very flexible in building the data required for POST requests. If the server requests to send form data, it can specify the keyword parameter data. If it wants to pass the json format string parameter, it can use the json keyword parameter, and the value of the parameter can be passed in the form of a dictionary.

Transfer as form data to the server


>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post("http://httpbin.org/post", data=payload)

Transmitted to the server as a string in json format


>>> import json
>>> url = 'http://httpbin.org/post'
>>> payload = {'some': 'data'}
>>> r = requests.post(url, json=payload)

Response content

An important part of the response message returned by HTTP is the response body, which is handled flexibly in requests. The properties related to the response body are: content, text, json().

content is of type byte and is suitable for saving content directly to the file system or transferring it to the network


>>> r = requests.get("https://pic1.zhimg.com/v2-2e92ebadb4a967829dcd7d05908ccab0_b.jpg")
>>> type(r.content)
<class 'bytes'>
#  Save as  test.jpg
>>> with open("test.jpg", "wb") as f:
...  f.write(r.content)

text is of type str. For example, for a normal HTML page, text is used when text needs to be further analyzed.


>>> r = requests.get("https://foofish.net/understand-http.html")
>>> type(r.text)
<class 'str'>
>>> re.compile('xxx').findall(r.text)

If you use the third party open platform or the API interface to crawl the data and the content returned is in json format, you can directly use the json() method to return a pass json.loads() The processed object.


>>> r = requests.get('https://www.v2ex.com/api/topics/hot.json')
>>> r.json()
[{'id': 352833, 'title': ' In changsha, my parents live together ...

The proxy Settings

When the crawler frequently crawls the content from the server, it is easy to be blocked by the server. Therefore, it is a wise choice to use the proxy if you want to continue to crawl the data smoothly. If you want to climb the data outside the wall, the same agent can be set up to solve the problem, requests supports the agent perfectly. Here I am using the local ShadowSocks agent, which is how the socks agent is installed pip install requests[socks] )


#  Status code 
>>> response.status_code
200

#  The reason the phrase 
>>> response.reason
'OK'

#  In response to the first 
>>> for name,value in response.headers.items():
...  print("%s:%s" % (name, value))
...
Content-Encoding:gzip
Server:nginx/1.10.2
Date:Thu, 06 Apr 2017 16:28:01 GMT

#  Response content 
>>> response.content

'<html><body> Is omitted 1 swastika ...</body></html>
0

timeout

When requests sends a request, thread 1 blocks by default and does not process the logic until a response is returned. If you encounter a situation where the server is not responding, the problem becomes so severe that the entire application 1 is blocked and unable to process other requests.


#  Status code 
>>> response.status_code
200

#  The reason the phrase 
>>> response.reason
'OK'

#  In response to the first 
>>> for name,value in response.headers.items():
...  print("%s:%s" % (name, value))
...
Content-Encoding:gzip
Server:nginx/1.10.2
Date:Thu, 06 Apr 2017 16:28:01 GMT

#  Response content 
>>> response.content

'<html><body> Is omitted 1 swastika ...</body></html>
1

The correct way is to specify a timeout for each request display.


#  Status code 
>>> response.status_code
200

#  The reason the phrase 
>>> response.reason
'OK'

#  In response to the first 
>>> for name,value in response.headers.items():
...  print("%s:%s" % (name, value))
...
Content-Encoding:gzip
Server:nginx/1.10.2
Date:Thu, 06 Apr 2017 16:28:01 GMT

#  Response content 
>>> response.content

'<html><body> Is omitted 1 swastika ...</body></html>
2

Session

The HTTP protocol is a stateless protocol in HTTP protocol (1). In order to maintain the state of communication between the client and the server, Cookie technology is used to maintain the state of communication between the two sides.

Some web pages are need to log in to the crawler, and is the principle of the browser is the first time through the user name password after login server to the client sends a random Cookie, next time the browser requests to other pages, to just cookie as request 1 sent to the server, so that the server will know that the user is already logged in user.


import requests
#  Build a session 
session = requests.Session()
# The login url
session.post(login_url, data={username, password})
# Can only be accessed after logging in url
r = session.get(home_url)
session.close()

After building an session session, the client initiates a request to log in the account for the first time, and the server automatically saves the cookie information in the session object. When the second request is made, requests automatically sends the cookie information in session to the server to keep the communication state.

The project of actual combat

Finally, there is a practical project. I will explain how to use requests to automatically log in zhihu and send private messages to users in the next article.

conclusion


Related articles: