A common simple python crawler detailed explanation of js anti crawling

  • 2021-07-13 05:33:51
  • OfStack

Preface

When we write crawlers, we should encounter the most js anti-crawling. Today, we share a common js anti-crawling, which I have seen on many websites.

I divide js anti-crawling into two parts: parameters generated by js encryption and js generated cookie to operate browsers. Today, I am talking about the second case.

Target site

List page url: http://www. hnrexian. com/archives/category/jk.

Normal website we request url will return to us page data content and so on, see what this website returns to us?

We formatted the js code returned in the corresponding one for easy viewing.


< script type = "text/javascript" >
function stringToHex(str) {
  var val = "";
  for (var i = 0; i < str.length; i++) {
    if (val == "") val = str.charCodeAt(i).toString(16);
    else val += str.charCodeAt(i).toString(16);
  }
  return val;
}
function YunSuoAutoJump() {
  var width = screen.width;
  var height = screen.height;
  var screendate = width + "," + height;
  var curlocation = window.location.href;
  if ( - 1 == curlocation.indexOf("security_verify_")) {
    document.cookie = "srcurl=" + stringToHex(window.location.href) + ";path=/;";
  }
  self.location = "/archives/category/jk?security_verify_data=" + stringToHex(screendate);
} < /script>
 <script>setTimeout("YunSuoAutoJump()", 50);</script > 

Say good return page data source code, what is this thing!

js cracking ideas

js cracking provides two ideas. One is to rewrite js content directly with Python to realize the operation of simulating js. This one is generally used for relatively simple js;; Another is to use the third square library of Python to parse js, such as pyv8 and execjs (I think execjs is easier to use), which is generally used for complex js parsing.

The js returned from the analysis is divided into two parts. In Part 1, two functions, stringToHex and YunSuoAutoJump, are defined. In Part 2, execute the function YunSuoAutoJump after 50 milliseconds.

The function YunSuoAutoJump adds an cookie and requests a constructed url, as you can see from document. cookie and self. location. The common function of stringToHex is actually string conversion. For the specific content of js, please refer to the website https://www.runoob.com/js/js-tutorial.html.

python Rewrite Code

Then the next step is to rewrite js with python, and the rewritten code is as follows.


def stringToHex(string):
  length = len(string)
  hex_string = str()
  for i in xrange(length):
    hex_string += hex(ord(string[i]))[2:]
  return hex_string

def get_cookie(url):
  hex_string = stringToHex(url)
  cookie = {"srcurl": hex_string, "path": "/"}
  return cookie

These are the two functions, one for string conversion and one for getting cookie.

Finally get the result

Next, simulate the browser operation, which is divided into three parts. For the first time, we request the target url and then return us the js content; The second time, js added an cookie and requested a constructed url; Request the original target url for the third time and get the final data.

Here we use requests. Session to keep the connection, simulating the contents of the above three parts.


url = "http://www.hnrexian.com/archives/category/jk"
s = requests.Session()
r = s.get(url)
url_2 = re.compile("self\.location\s*=\s*\"(.*?)\"").findall(r.text)[0]
screen_date = "1920,1080"
url_2 = url_2 + stringToHex(screen_date)
url_2 = urljoin(url, url_2)
cookie = get_cookie(url)
s.cookies.update(cookie)
r2 = s.get(url_2)
url3 = re.compile("self\.location\s*=\s*\"(.*?)\"").findall(r2.text)[0]
r3 = s.get(url3)
r3.encoding = "gbk"
print r3.text

Here we get the final content perfectly.


Related articles: