python crawler has a 403 disabled access error

  • 2020-05-27 05:56:40
  • OfStack

The python crawler resolved a 403 disabled access error

When Python writes crawlers, html.getcode () will encounter the problem of 403 blocking access, which is the site's ban on automated crawlers. To solve this problem, python's module urllib2 needs to be used

urllib2 module is an advanced spiders crawling module, there are so many methods, for example the connection url = / / www ofstack. com/qysh123 403 forbidden access to this connection it is possible to appear

Solving this problem requires the following steps:


<span style="font-size:18px;">req = urllib2.Request(url) 
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36") 
req.add_header("GET",url) 
req.add_header("Host","blog.csdn.net") 
req.add_header("Referer","//www.ofstack.com/")

User-Agent is a browser-specific property, which can be viewed by the browser when viewing the source code

then


html=urllib2.urlopen(req)


print html.read()

You can download all the code for the web page without the 403 problem.

For the above problems, it can be encapsulated into a function for easy use in future calls. The specific code is:


#-*-coding:utf-8-*- 
 
import urllib2 
import random 
 
url="//www.ofstack.com/article/1.htm" 
 
my_headers=["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36", 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36", 
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0" 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14", 
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)" 
  
] 
def get_content(url,headers): 
 ''''' 
 @ To obtain 403 Pages that are inaccessible  
 ''' 
 randdom_header=random.choice(headers) 
 
 req=urllib2.Request(url) 
 req.add_header("User-Agent",randdom_header) 
 req.add_header("Host","blog.csdn.net") 
 req.add_header("Referer","//www.ofstack.com/") 
 req.add_header("GET",url) 
 
 content=urllib2.urlopen(req).read() 
 return content 
 
print get_content(url,my_headers) 

Among them, random random function is used to automatically obtain the User-Agent information of the browser type that has been written. In the custom function, you need to write your Host,Referer,GET information, etc. To solve these problems, you can access the information smoothly, and no information accessed by 403 will appear any more.

Of course, if the frequency of visit is too fast, some websites will still filter, to solve this need to use the agent IP method... Work it out on your own

Thank you for reading, I hope to help you, thank you for your support of this site!


Related articles: