python addresses the site's anti crawler strategy summary

2020-05-12 02:52:24
OfStack

This paper introduces the website's anti-crawler strategy in detail, here I have encountered a variety of anti-crawler strategy and coping methods to summarize 1.

In terms of function, crawler 1 can be divided into three parts: data collection, processing and storage. We are only talking about the data acquisition part here.

1 general website from three aspects of anti-crawler: user request Headers, user behavior, website directory and data loading. The first two are relatively easy to encounter, and most websites are anti-crawlers from these angles. Third, some sites that use ajax will use it, which makes it harder to crawl (to prevent static crawlers from dynamically loading pages using ajax technology).

1. Headers anti-crawler requested from users is the most common anti-crawler strategy.

Camouflage header. Many sites will test Headers's User-Agent, and one part will test Referer (the anti-hotlinking of some resource sites will test Referer). If you encounter this kind of anti-crawler mechanism, you can directly add Headers to the crawler and copy the browser's User-Agent to the crawler's Headers. Or change the Referer value to the domain name of the target site [comment: often overlooked, identify referer by capturing the request, and add it to the mock access request header in the program]. For anti-crawler detection of Headers, modifying or adding Headers in the crawler is a good way to bypass it.

2. Anti-crawler based on user behavior

Another part of the website detects user behavior, such as multiple visits to the same page within a short period of time with 1IP, or multiple operations with the same account within a short period of time. [for this kind of anti-crawling, you need to have enough ip to deal with it]

(1) most websites are in the first case, which can be solved by using IP agent. You can write a crawler, climb the online agent ip, and save it all after detection. With a large number of agents ip, you can replace one ip every few requests, which is easy to do in requests or urllib, so you can easily bypass the first anti-crawler.

Write crawler agent:

Steps:

1. Parameter is 1 dictionary {' type ':' proxy ip: port number '}
proxy_support = urllib. request. ProxyHandler ({})
2. Customize and create an opener
opener = urllib. request. build_opener (proxy_support)
3 a. Install opener
urllib. request. install_opener (opener)
3 b. Call opener
opener. open (url)

Use a large number of agents to randomly request the target site to counter anti-crawlers


#! /usr/bin/env python3.4
#-*- coding:utf-8 -*-
#__author__ == "tyomcat"


import urllib.request
import random
import re

url='http://www.whatismyip.com.tw'
iplist=['121.193.143.249:80','112.126.65.193:80','122.96.59.104:82','115.29.98.139:9999','117.131.216.214:80','116.226.243.166:8118','101.81.22.21:8118','122.96.59.107:843']

proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})
opener=urllib.request.build_opener(proxy_support)
opener.addheaders=[('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36')]
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')

pattern = re.compile('<h1>(.*?)</h1>.*?<h2>(.*?)</h2>')
iterms=re.findall(pattern,html)
for item in iterms:
  print(item[0]+":"+item[1])

(2) for the second case, the next request can be made at a random interval of several seconds after each request. Some sites with logic flaws can bypass the restriction that the same account cannot make the same request more than once in a short period of time by requesting several times, logging out, logging in again, and continuing the request. [comment: it is difficult to deal with the anti-crawling restriction on the account, and requests for a few seconds at random may be blocked. If you can have multiple accounts, it is better to switch to use them.]

3. Anti-crawler of dynamic pages

Most of the above situations appear on static pages. There is also a part of the website where the data we need to crawl is requested by ajax or generated by Java.

Solution: Selenium+PhantomJS

Selenium: automated web test solution that fully simulates the real browser environment and virtually all user actions

PhantomJS: a browser without a graphical interface

Get taobao sister's personal details address:


#! /usr/bin/env python
# -*- coding:utf-8 -*-
#__author__ == "tyomcat"

from selenium import webdriver
import time
import re

drive = webdriver.PhantomJS(executable_path='phantomjs-2.1.1-linux-x86_64/bin/phantomjs')
drive.get('https://mm.taobao.com/self/model_info.htm?user_id=189942305&is_coment=false')

time.sleep(5)

pattern = re.compile(r'<div.*?mm-p-domain-info">.*?class="mm-p-info-cell clearfix">.*?<li>.*?<label>(.*?)</label><span>(.*?)</span>',re.S)
html=drive.page_source.encode('utf-8','ignore')
items=re.findall(pattern,html)
for item in items:
  print item[0],'http:'+item[1]
drive.close()

Thank you for reading, I hope to help you, thank you for your support of this site!