The python crawler is a simple way to set ip for each agent

2021-11-24 01:54:47
OfStack

The python crawler sets the method for each agent ip:

1. Add 1 code, set the agent, and change one agent every 1 period of time.

urllib2 sets HTTP Proxy with the environment variable http_proxy by default. If a site it will detect a certain period of time a certain number of IP visits, if the number of visits too many, it will prohibit your access. So you can set up a proxy server to help you do the work, every once in a while to change a proxy, the site does not know who is playing tricks, this sour cool! The following 1 code illustrates the proxy setup usage.


import urllib2
enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

2. Timeout setting can solve the problem of slow response caused by 1 website.

Have said before urlopen method, the third parameter is timeout settings, you can set how long to wait for the timeout, in order to solve the 1 site response is too slow and caused by the impact. For example, in the following code, if the second parameter data is empty, you should specify what timeout is, and specify the formal parameter. If data has been passed in, you don't have to declare it.


import urllib2
response = urllib2.urlopen('http://www.baidu.com', timeout=10)


import urllib2
response = urllib2.urlopen('http://www.baidu.com',data, 10)

This is how to set up proxies in Python crawler. At the end, we added the usage of timeout, which is intended to make everyone need a good solution to the problem of slow network.

However, address proxy is used more, so we should focus on learning. If you need ip, you can try IP products preferred by Sun http, crawler collection, marketing promotion, studio and other industries. The number of urban lines in China is 200 +, the calling frequency of API is not limited, and the IP pool is continuously updated 24 hours a day.

Extension of knowledge points:

Code extension:


from bs4 import BeautifulSoup
import requests
import random

def get_ip_list(url, headers):
 web_data = requests.get(url, headers=headers)
 soup = BeautifulSoup(web_data.text, 'lxml')
 ips = soup.find_all('tr')
 ip_list = []
 for i in range(1, len(ips)):
  ip_info = ips[i]
  tds = ip_info.find_all('td')
  ip_list.append(tds[1].text + ':' + tds[2].text)
 return ip_list

def get_random_ip(ip_list):
 proxy_list = []
 for ip in ip_list:
  proxy_list.append('http://' + ip)
 proxy_ip = random.choice(proxy_list)
 proxies = {'http': proxy_ip}
 return proxies

if __name__ == '__main__':
 url = 'http://www.xicidaili.com/nn/'
 headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'
 }
 ip_list = get_ip_list(url, headers=headers)
 proxies = get_random_ip(ip_list)
 print(proxies)