Node.js implementation of simple web crawling function example

  • 2020-03-30 04:29:32
  • OfStack

Nowadays, web crawling is a well-known technology, but there is still a lot of complexity, and simple web crawlers are still not suitable for Ajax training, XMLHttpRequest, WebSockets, Flash Sockets and other complex technologies developed by modern websites.

We took our basic requirements for the Hubdoc project, where we pulled bills, due dates, account Numbers, and most importantly: PDFS of recent bills from the websites of Banks, utilities, and credit card companies. For this project, I started with a very simple solution (not using the expensive commercial product we were evaluating for the time being) - a simple crawler project I had previously done in MessageLab/Symantec using Perl. But the results were far from smooth, with spammers producing much simpler websites than Banks and utilities.

So how to solve this problem? We mainly use from Mikea development (link: https://github.com/mikeal/request) library. Make the request in the browser, look in the Network window to see what headers were sent, and copy those headers into the code. The process is simple. It simply tracks the process from login to download the Pdf file, and then simulates all the requests for that process. In order to make similar things become easy, and can make the network more reasonable developers write crawlers, I take to the result from the HTML in the square to export to the jQuery (using lightweight (link: https://github.com/MatthewMueller/cheerio) library), which makes the similar work become simple, also make use of CSS selector for selecting the elements of a page becomes simpler. The whole process is wrapped into a framework that can also do additional work, such as picking up certificates from a database, loading individual robots, and communicating with the UI via socket.io.

This works for some web sites, but it's just a JS script, not my node. JS code that these companies put on their sites. They layer the remaining issues into complexity, making it hard to figure out what to do to get the login information points. For some sites I tried for a few days to get them by combining them with the request() library, but in vain.

After almost collapsed, I found (link: https://github.com/sgentle/phantomjs-node), the library can let me from the node control (link: http://phantomjs.org/) headless its browser (translator note: that I didn't expect a corresponding noun, headless here means to render the page, in the background without display device). It seems like a simple solution, but there are some issues phantomjs can't avoid:

PhantomJS can only tell you if the page has finished loading, but you can't be sure if there is a redirect via JavaScript or meta tags. This is especially true when JavaScript USES setTimeout() to defer calls.

2. PhantomJS provides you with a page loading start (pageLoadStarted) hooks, allows you to deal with the above mentioned problem, but this function can only be in are you sure you want to load the number of pages, on each page load is complete to reduce this number, and provide possible timeout processing (because this kind of thing doesn't always happen), so that when you are reduced to zero, you can call your callback function. This works, but it always feels a bit like hacking.

PhantomJS needs a completely separate process for each page it crawls.otherwise, cookies between pages cannot be separated. If you're using the same phantomjs process, the session in the logged in page will be sent to another page.

PhantomJS cannot be used to download resources - you can only save pages as PNG or PDF. This is useful, but it means we need to turn to request() to download the PDF.

5. For the above reasons, I had to find a way to distribute cookies from PhantomJS's session to the session library of request(). Just distribute the document.cookie string, parse it, and inject it into the cookie jar for request().

10. I also need to limit the maximum concurrency of the browser session to ensure that we don't crash the server. That being said, the limitation is much higher than what expensive commercial solutions can offer. The business solution has more concurrency than the solution.

After all the work was done, I had a decent crawler solution with PhantomJS + request. You must log in with PhantomJS before you can return the request() request, which USES the Cookie set in PhantomJS to validate the logged in session. This is a big win because we can use the stream from request() to download PDF files.

The whole idea is to make it relatively easy for Web developers to understand how to use jQuery and CSS selectors to create crawlers for different Web sites.


Related articles: