python Advanced Crawler Combat analysis

  • 2020-11-20 06:09:37
  • OfStack

I would like to say a few words about this article. First of all, I would like to apologize to all of you. When I was learning the following things, I really thought they were quite good, but later I found that they were the foundation within the foundation and the content was not quite complete. After reading the article I wrote again, I suddenly felt an impulse to commit suicide. Therefore, I decided to erase and rewrite the whole article, and I would like to express my sincere apologies to the 710 + visitors who clicked on this page before. I want to die.

After learning all the contents of the crawler, the building owner felt barely qualified to point out the way for the new contact crawler. So without further ado, the following text:

1. Get the content

Say crawler 1 must first say crawler method, python has so many support crawler library, one is urllib and its subsequent version library, this library do crawler generated relay objects is more, the landlord can not remember what there is, and the use of this library in the view of the landlord some outdated. It is more recommended to use the requests library when crawling (ps: not request)

Using urllib:


html = urllib.request.urlopen(url).read()

Using requests:


r = requests.get(url)

The content retrieved is handled in the following ways:
1. Use regular expression matching.

2. Use BeautifulSoup to objectify crawl content tags.

3. Use Xpath to get elements by constructing node trees.

The first method is more direct, efficient, and does not require the installation of a 3-square library. The second method is much easier than the previous one, it does not need to write complex regular expressions after tag objectification, and it is more convenient to extract tags. The third method is more flexible, which means that the syntax is a little bit too much. If you are not familiar with it, you can write to the Xpath grammar document.

Use regular expression matching:


pattern_content = '<div class="rich_media_content " id="js_content">(.*?)</div>'
content1 = re.findall(pattern_content, html, re.S)

Objectifying crawl content tags using BeautifulSoup:


soup = bs4.BeautifulSoup(html, 'lxml')
imgs = soup.find_all('img')

As for the installation of BeautifulSoup, please go to Baidu by yourself. If you remember correctly, pip is feasible.

Use Xpath to get elements by constructing a node tree:


selector=etree.HTML(html)
content=selector.xpath('//div[@id="content"]/ul[@id="ul"]/li/text()')

At this point, the basics of crawling are covered. This is the simplest example, and if you want to dig deeper into a method, it is recommended to consult more detailed technical documentation.

The following is the previous content, with some modifications.

2. Forge the form request header

It is easy to crawl the data on many websites, just need the url of request directly, many small websites are like this. Faced with such website data, it only takes a few minutes to write a few lines of random code, can climb to the data we want.

But it's not so easy to crawl slightly larger sites. Servers for these sites analyze each request received to determine whether the request is a user action. This technique, we call it the reverse crawling technique. Common anti-crawling techniques, the building owner is aware of the above mentioned analysis requests, as well as captcha techniques. In both cases, we need to work a little harder to build a crawler.

Let's start with the first one. The first thing to know about request is that the request format may vary a bit from site to site. For this point, we can use the developer tool of the browser to capture the request data format of a website. The diagram below:

11111

This is the request information fetched using the Google browser.

We can see the request headers format, so when visiting a site like this, we can't forget to put a fake headers in postdata.


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0',
			  'Referer': 'Address'}

Where the value corresponding to the referer key is the url to visit.

Some sites will also require user authentication for cookie, which we can call

[

requests.Session().cookies

]

To get it.

If some information needs to be submitted in the crawler, the data of postdata under 1 should also be constructed. Like this:


postData = {
		'username': ul[i][0],
		'password': ul[i][1],
		'lt': b.group(1),
		'execution': 'e1s1',
		'_eventId': 'submit',
		'submit': '%B5%C7%C2%BC',
	}

3. Crawling of multiple web pages

If regular web address, then construct url function is good, with a circulation for random code contained in the web address, is usually climb take root page first, access to all want to crawl a page url, put the url in 1 url pool (small project is a list of 1 d, project of age may be higher dimensional list), cyclic crawl.

The more efficient way is to use multithreading technology. demo is a little long and only sticks to the key parts.


class Geturl(threading.Thread):
  def __init__(self):
    threading.Thread.__init__(self)
  def run(self):
    res = requests.get(url, headers=header)
    html = res.text
    # print(html)
    pattern_href = '<a target="_blank" href="(.*?)" rel="external nofollow" id'
    href = re.findall(pattern_href, html, re.S)
    for href in href:
      href = href.replace('amp;', '')
      a.put(href)
      a.task_done()
class Spider(threading.Thread):
  def __init__(self):
    threading.Thread.__init__(self)
  def run(self):
    href = a.get()
    res = requests.get(href, headers=header2)
    html = res.text
    pattern_title = '<title>(.*?)</title>'
    title = re.findall(pattern_title, html, re.S)
    pattern_content = '<div class="rich_media_content " id="js_content">(.*?)</div>'
    content1 = re.findall(pattern_content, html, re.S)
    print(title)
    # time.sleep(1.5)
    pattern_content2 = '>(.*?)<'
    content2 = re.findall(pattern_content2, content1[0], re.S)
    while '' in content2:
      content2.remove('')
    content = ''
    for i in content2:
      content = content + i
    content = content.replace('&nbsp;','')
    print(content)

Two threads are opened, one crawls url into the url pool, one crawls url from the url pool and crawls the content, and another thread is opened to monitor the two threads. If the two threads finish running, the main thread ends.

python's multithreading mechanism does not do well underneath, the reason not to say. In addition, multithreading specific operation a lot of will not expand to talk about.

4. About using proxy ip

Many websites have AN ip detection mechanism, which is usually triggered when the same 1ip visits the site multiple times faster than a human could.

Proxy ip usually builds the ip proxy pool by crawling 1 of ip published on the open source ip website, such as Spitfire, mushroom, etc. Such 1 some websites, direct Baidu agent ip can be found. Then, the agent pool is maintained using Flask+Redis. This part is also longer in detail, so I won't go into details. Nor is it necessary for a reptile. In addition, those who have their own servers can also use the over the wall tool of SSR, but the building is not done by the owner himself, so I won't elaborate on it.

5. selenium mimics browser action

selenium mainly introduces the following points:

1. selenium is a complete set of web application test system, which includes recording test (selenium IDE), writing and running test (Selenium Remote Control) and parallel processing of test (Selenium Grid).

Selenium Core, the core of Selenium, is based on JsUnit and written entirely by JavaScript, so it can be used on any browser that supports JavaScript.

3. selenium can simulate real browsers, automatic testing tools and support a variety of browsers. In crawler, it is mainly used to solve JavaScript rendering problems.

4. When writing crawlers in python, we mainly use Webdriver.

These are the contents of an explanatory document, can understand to see, do not understand to see the simple version of the main building:

selenium is mainly used to mimic browser actions, such as assigning values to text boxes, clicking buttons, etc. With efficient browser is also a better way to achieve crawler. The advantage is that by simulating the browser operation, it is not easy to be detected by anti-crawling. The disadvantage is low efficiency, very not suitable for large reptiles, small workshops play with themselves.

6. Scrapy Framework

This is another very, very big piece of content, and a lot of technologies get in trouble once they get involved in the framework. Of course, it's much easier to do big projects. The point is that framework 1 is generally aimed at larger systems, so its management and operation will be more difficult, some internal mechanisms are not very well explained. If there is time in the future to write a separate article in detail, after all, from the principle to build to configuration to use, too much content.

7. About captCHA processing

For captcha processing, the current simple point is to directly use PIL (pillow) for image processing, and then use Tesseract for direct recognition. The author of this method has written a separate article for your reference.

In addition, if you have learned about machine learning neural networks, you can also use convolutional neural networks. This direction building Lord still is learning, point a road to everybody simply, do not elaborate.

The above is all the contents of the crawler thought of the landlord, if there is a mistake also hope to correct.


Related articles: