Quickly increase blog reading through Python crawler agent IP

2020-05-17 05:53:47
OfStack

Writing in the front

What the title says is not the purpose, but mainly to understand the anti-crawling mechanism of the website in more detail. If you really want to improve the reading volume of the blog, quality content is essential.

Understand the anti-crawling mechanism of the website

1 general website from the following aspects of anti-crawler:

1. Via Headers anti-crawler

The Headers anti-crawler requested from the user is the most common anti-crawler strategy. Many websites will test Headers's User-Agent, and one part will test Referer (the anti-hotlinking of some resource websites is to test Referer).

If you encounter this kind of anti-crawler mechanism, you can directly add Headers to the crawler and copy the browser's User-Agent to the crawler's Headers. Or change the Referer value to the domain name of the target site. For anti-crawler detection of Headers, modifying or adding Headers in the crawler is a good way to bypass it.

2. Anti-crawler based on user behavior

Another part of the website detects user behavior, such as multiple visits to the same page within a short period of time with 1IP, or multiple operations with the same account within a short period of time.

Most sites are in the first case, which can be resolved using the IP agent. We can save the agent IP in the file after detection, but this method is not advisable, because the agent IP is likely to fail, so it is a good choice to grab it in real time from the specialized agent IP website.

In case 2, the next request can be made at random intervals of a few seconds after each request. Some sites with logic flaws can bypass the restriction that the same account cannot make the same request more than once in a short period of time by requesting several times, logging out, logging in again, and continuing the request.

Also for cookies, check cookies to see if the user is a valid user, a technique often used by sites that require a login. Further to the point 1, the login verification of some websites will be dynamically updated. For example, when twitcool logs in, authenticity_token for login verification will be randomly assigned, and authenticity_token will be sent back to the server together with the login name and password submitted by the user.

3. Anti-crawler based on dynamic page

Sometimes the target page crawl down and found the information content of a key blank 1 slice, only the framework code, this is because the website information is through a user Post XHR dynamic return information content, the solution to this problem is through the developer tools (such as FireBug) analysis of the flow of the website, find information on individual content request (e.g., Json), information for fetching the content and obtain the required content.

More complex 1 point to the dynamic request encryption, parameters can not be parsed, can not be captured. In this case, you can use Mechanize, selenium RC, call the browser kernel, just like you really use the browser to surf the Internet, you can grab the maximum success, but the efficiency will be some discount. In the author's test, it takes more than 310 seconds to grab 30 pages of job information on pull-forward website with urllib, while it takes 2-3 minutes to grab the information with simulated browser kernel.

4. Restrict certain IP access

Free agent IP can be obtained from many websites. Since the crawler can use these agents IP for website crawling, the website can also use these agents IP for reverse restriction. By crawling these agents IP and saving them on the server, the crawler can use the agent IP to restrict the crawler.

Get into the business

Ok, now, in practice 1, write a crawler that accesses the site through the proxy IP.

First get the agent IP for fetching.


def Get_proxy_ip():
 headers = {
 'Host': 'www.xicidaili.com',
 'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Accept': r'application/json, text/javascript, */*; q=0.01',
 'Referer': r'http://www.xicidaili.com/', 
 }
 req = request.Request(r'http://www.xicidaili.com/nn/', headers=headers) # Release agent IP The web site 
 response = request.urlopen(req)
 html = response.read().decode('utf-8')
 proxy_list = []
 ip_list = re.findall(r'\d+\.\d+\.\d+\.\d+',html)
 port_list = re.findall(r'<td>\d+</td>',html)
 for i in range(len(ip_list)):
 ip = ip_list[i]
 port = re.sub(r'<td>|</td>', '', port_list[i])
 proxy = '%s:%s' %(ip,port)
 proxy_list.append(proxy)
 return proxy_list

By the way, some websites restrict crawlers from crawling by checking the real IP of the agent IP. Here's a little bit about agent IP.

Agent IP "transparent", "anonymous" and "high Nome" respectively mean?

Transparent proxy means that the client does not need to know about the proxy server at all, but it is still transmitting the real IP. With transparent IP, you can't get around the limit of the number of times you can access IP over a set period of time.

The normal anonymous proxy can hide the client's true IP, but it will change our request information and the server may think we are using the proxy. However, when using this proxy, although the site being visited cannot know your ip address, it can still know that you are using the proxy, and IP will be banned from the site.

The high anonymity proxy does not change the client's request, so that the server looks as if a real client browser is accessing it, and the client's true IP is hidden, so the web site does not think we are using the proxy.

To sum up, the best use of crawler agent IP is "ghanith IP".

user_agent_list contains the RequestHeaders user-agent of the current mainstream browser requests, through which we can simulate the requests of various browsers.


user_agent_list = [
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]

By setting a random wait time to visit a site, you can circumvent the restrictions that some sites place on the request interval.


def Proxy_read(proxy_list, user_agent_list, i):
 proxy_ip = proxy_list[i]
 print(' The current agent ip : %s'%proxy_ip)
 user_agent = random.choice(user_agent_list)
 print(' The current agent user_agent : %s'%user_agent)
 sleep_time = random.randint(1,3)
 print(' Waiting time: %s s' %sleep_time)
 time.sleep(sleep_time) # Set the random wait time 
 print(' Began to get ')
 headers = {
 'Host': 's9-im-notify.csdn.net',
 'Origin':'http://blog.csdn.net',
 'User-Agent': user_agent,
 'Accept': r'application/json, text/javascript, */*; q=0.01',
 'Referer': r'http://blog.csdn.net/u010620031/article/details/51068703',
 }
 proxy_support = request.ProxyHandler({'http':proxy_ip})
 opener = request.build_opener(proxy_support)
 request.install_opener(opener)
 req = request.Request(r'http://blog.csdn.net/u010620031/article/details/51068703',headers=headers)
 try:
 html = request.urlopen(req).read().decode('utf-8')
 except Exception as e:
 print('****** Open failed! ******')
 else:
 global count
 count +=1
 print('OK! A total success %s Times! '%count)

Above is the relevant knowledge of crawler using proxy, although it is still very simple, but most of the scene can be dealt with.

Attach the source code


#! /usr/bin/env python3
from urllib import request
import random
import time
import lxml
import re
user_agent_list = [
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
 'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]
count = 0
def Get_proxy_ip():
 headers = {
 'Host': 'www.xicidaili.com',
 'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
 'Accept': r'application/json, text/javascript, */*; q=0.01',
 'Referer': r'http://www.xicidaili.com/',
 }
 req = request.Request(r'http://www.xicidaili.com/nn/', headers=headers)
 response = request.urlopen(req)
 html = response.read().decode('utf-8')
 proxy_list = []
 ip_list = re.findall(r'\d+\.\d+\.\d+\.\d+',html)
 port_list = re.findall(r'<td>\d+</td>',html)
 for i in range(len(ip_list)):
 ip = ip_list[i]
 port = re.sub(r'<td>|</td>', '', port_list[i])
 proxy = '%s:%s' %(ip,port)
 proxy_list.append(proxy)
 return proxy_list
def Proxy_read(proxy_list, user_agent_list, i):
 proxy_ip = proxy_list[i]
 print(' The current agent ip : %s'%proxy_ip)
 user_agent = random.choice(user_agent_list)
 print(' The current agent user_agent : %s'%user_agent)
 sleep_time = random.randint(1,3)
 print(' Waiting time: %s s' %sleep_time)
 time.sleep(sleep_time)
 print(' Began to get ')
 headers = {
 'Host': 's9-im-notify.csdn.net',
 'Origin':'http://blog.csdn.net',
 'User-Agent': user_agent,
 'Accept': r'application/json, text/javascript, */*; q=0.01',
 'Referer': r'http://blog.csdn.net/u010620031/article/details/51068703',
 }
 proxy_support = request.ProxyHandler({'http':proxy_ip})
 opener = request.build_opener(proxy_support)
 request.install_opener(opener)
 req = request.Request(r'http://blog.csdn.net/u010620031/article/details/51068703',headers=headers)
 try:
 html = request.urlopen(req).read().decode('utf-8')
 except Exception as e:
 print('****** Open failed! ******')
 else:
 global count
 count +=1
 print('OK! A total success %s Times! '%count)
if __name__ == '__main__':
 proxy_list = Get_proxy_ip()
 for i in range(100):
 Proxy_read(proxy_list, user_agent_list, i)