Summary of knowledge points of Python crawler breaking through anti crawler mechanism

  • 2021-12-12 09:07:10
  • OfStack

1. Build a reasonable HTTP request header.

The request header for HTTP is a set of properties and configuration information when you send a request to a network server. Because the browser and Python crawler send different request headers, the anti-crawler is likely to be detected.

2. Establish and learn cookie.

Cookie is a double-edged sword, not with it, not even without it. The site will track your visits through cookie, and if you are found to have crawler behavior, it will immediately interrupt your visits, for example, filling out forms too fast or browsing a large number of web pages in a short time. Moreover, the correct handling of cookies can also avoid many collection problems. It is suggested that in the process of collecting websites, check the cookie generated by these websites, and then think about which crawler needs to deal with.

3. Normal time difference path.

Python spider should not break the principle of acquisition speed, and increase 1 short interval every 1 page access time as much as possible, which can effectively help you avoid anti-crawling.

4. Use proxy IP. For distributed crawlers that have encountered anti-crawlers, using proxy IP will become your first choice.

When it comes to the development history of Python reptiles, it is simply a history of blood and tears in love with anti-reptiles. On the Internet, where there are web crawlers, there is absolutely no shortage of anti-crawlers. The premise of anti-crawler interception of websites is to correctly distinguish people from network robots. When suspicious targets are found, measures such as restricting IP addresses are taken to prevent you from continuing to visit.

Extension of knowledge points:

python3 Reptilian--Anti-reptilian Coping Mechanism

Foreword:

Anti-crawler is more of an offensive and defensive war, and web crawler 1 has the way of web crawler and interface crawler; To take the corresponding coping mechanism for the anti-crawler processing of the website, 1 generally needs to consider the following aspects:

(1) Access terminal restriction: this can be realized by forging dynamic UA;

② Access times limit: Website 1 is generally located through cookie/IP, which can be countered by disabling cookie or using cookie pool/IP pool;

③ Access time limit: delay request response;

④ Chain theft problem: Generally speaking, the request of a web page can be traced. For example, in Zhihu's question answer details page, the normal user behavior must enter the question page first. When entering the answer details page, there is a strict request order. If you skip the previous request page between them, it may be judged that it has arrived. This problem can be solved by forging the request header;

Specific anti-crawler strategies:

① Verification code

Response: Simple verification code can be identified by machine learning, and the accuracy rate can be as high as 50-60%; Complex codes can be manually coded through a special coding platform (according to the complexity of verification codes, coding workers charge an average of 1-2 cents per code)

② Seal ip (easy to kill by mistake)

Response: Through ip proxy pool/vps dialing to obtain ip, several hundred thousand ip can be obtained at low cost

③ Sliding verification code: Compared with conventional verification code, it is easy to be recognized by machine learning, and sliding verification has one definite advantage

Response: Simulate sliding to verify

(4) associating context/anti-theft chain: using the recording ability of token/cookie to associate the context of the request, and judging whether the request is a crawler by judging whether it has gone through a complete process; Heavy and anti-crawler (Zhihu and Toutiao have this mechanism)

Response: Analyze the protocol and conduct full simulation

⑤ javascript participation operation: js analysis/operation is carried out on intermediate results by using the characteristic that simple crawler can not perform json operation

Response: Automated parsing can be carried out by bringing your own js engine module or directly using gratuitous browsers such as phantomjs

⑥ session ban: session request exceeds the threshold, which leads to ban (easy to lead to manslaughter)

⑦ UA ban: ua request exceeds the threshold, which leads to ban (easy to lead to manslaughter)

8 web-fongt anti-crawler mechanism: the source code does not show the content, but provides the character set. The character set is defined by font-face on the page, and it is displayed by unicode

Pet-name ruby others: such as code obfuscation, dynamic encryption scheme, false data and so on


Related articles: