Nginx anti crawler strategy to prevent UA from crawling websites

  • 2021-09-05 01:22:56
  • OfStack

New anti-crawler policy file:


vim /usr/www/server/nginx/conf/anti_spider.conf

File content


# Prohibit Scrapy Grasping of tools such as  
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { 
   return 403; 
} 
# Prohibit designation UA And UA Null access  
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) { 
   return 403;        
} 
# Prohibit non- GET|HEAD|POST Grasping of mode  
if ($request_method !~ ^(GET|HEAD|POST)$) { 
  return 403; 
}
# Shield single IP The command is 
#deny 123.45.6.7
# Seal the whole paragraph from 123.0.0.1 To 123.255.255.254 The command of 
#deny 123.0.0.0/8
# Seal IP Segment is from 123.45.0.1 To 123.45.255.254 The command of 
#deny 124.45.0.0/16
# Seal IP Segment is from 123.45.6.1 To 123.45.6.254 The command is 
#deny 123.45.6.0/24
#  The following IP They are all hooligans 
#deny 58.95.66.0/24;

Configure the use of

Introduced in the server of the site


#  Anti-reptile   
include /usr/www/server/nginx/conf/anti_spider.conf

Finally restart nginx

Is the verification valid

Analog YYSpider


 Lambda  curl -X GET -I -A 'YYSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 403
server: marco/2.11
date: Fri, 20 Mar 2020 08:48:50 GMT
content-type: text/html
content-length: 146
x-source: C/403
x-request-id: 3ed800d296a12ebcddc4d61c57500aa2

Simulate Baidu Baiduspider


 Lambda  curl -X GET -I -A 'BaiduSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 200
server: marco/2.11
date: Fri, 20 Mar 2020 08:49:47 GMT
content-type: text/html
vary: Accept-Encoding
x-source: C/200
last-modified: Wed, 18 Mar 2020 13:16:50 GMT
etag: "5e721f42-150ce"
x-request-id: e82999a78b7d7ea2e9ff18b6f1f4cc84

User-Agent, a common reptile


FeedDemon        Content acquisition  
BOT/0.1 (BOT for JCE) sql Injection  
CrawlDaddy      sql Injection  
Java          Content acquisition  
Jullo          Content acquisition  
Feedly         Content acquisition  
UniversalFeedParser   Content acquisition  
ApacheBench      cc Attacker  
Swiftbot        Useless reptile  
YandexBot        Useless reptile  
AhrefsBot        Useless reptile  
YisouSpider       Useless crawler (has been UC God horse search acquisition, this spider can let go!)  
jikeSpider       Useless reptile  
MJ12bot         Useless reptile  
ZmEu phpmyadmin     Vulnerability scanning  
WinHttp         Acquisition cc Attack  
EasouSpider       Useless reptile  
HttpClient      tcp Attack  
Microsoft URL Control  Scanning  
YYSpider        Useless reptile  
jaunty        wordpress Blasting scanner  
oBot          Useless reptile  
Python-urllib      Content acquisition  
Indy Library      Scanning  
FlightDeckReports Bot  Useless reptile  
Linguee Bot       Useless reptile 

The above is the Nginx anti-crawler strategy to prevent UA crawling site details, more information about Nginx anti-crawler please pay attention to other related articles on this site!


Related articles: