node. js crawler framework node crawler first experience

2021-09-11 19:14:07
OfStack

Baidu crawler is a word, and all the information related to python appears.

py also has many crawler frameworks, such as scrapy, Portia, Crawley and so on.

Before, I personally preferred to use C # as a reptile.

With the familiarity with nodejs. I found it better to do this in a scripting language, at least without writing so many entity classes. Moreover, Script 1 is relatively simple to use. 　　

Search for node+spider on github, and the number one is node-crawler

github:https://github.com/bda-research/node-crawler

Simple use

npm installation:


npm install crawler

new1 crawler Objects


var c = new Crawler({
 //  This callback function is called after each request is processed 
 callback : function (error, res, done) {
  if(error){
   console.log(error);
  }else{
   var $ = res.$;
   // $  Default to  Cheerio  Parser 
   //  It's the core jQuery The streamlined implementation can be based on jQuery Fast extraction of selector syntax DOM Element 
   console.log($("title").text());
  }
  done();
 }
});

Then add url to the crawler queue,


//  Will 1 A URL Join the request queue and use the default callback function 
c.queue('http://www.amazon.com');

//  Will be multiple URL Join the request queue 
c.queue(['http://www.google.com/','http://www.yahoo.com']);

Control concurrency speed

Crawler Framework 1 is to crawl multiple pages at the same time, but too fast will trigger the anti-crawler mechanism of the target website, and also affect the performance of other websites.

Control the maximum number of concurrency


var c = new Crawler({
 //  The maximum number of concurrency defaults to 10
 maxConnections : 1,

 callback : function (error, res, done) {
  if(error){
   console.log(error);
  }else{
   var $ = res.$;
   console.log($("title").text());
  }
  done();
 }
});

Use slow mode

Enable slow mode with the parameter rateLimit, rateLimit is idle for milliseconds between requests, and maxConnections is forcibly modified to 1.


var c = new Crawler({
 // `maxConnections`  Will be forced to modify to  1
 maxConnections : 10,

 //  Idle between requests 1000ms
 rateLimit: 1000,

 callback : function (error, res, done) {
  if(error){
   console.log(error);
  }else{
   var $ = res.$;
   console.log($("title").text());
  }
  done();
 }
});

Download static files such as pictures


var c = new Crawler({
 encoding:null,
 jQuery:false,// set false to suppress warning message.
 callback:function(err, res, done){
  if(err){
   console.error(err.stack);
  }else{
   fs.createWriteStream(res.options.filename).write(res.body);
  }
  
  done();
 }
});

c.queue({
 uri:"https://nodejs.org/static/images/logos/nodejs-1920x1200.png",
 filename:"nodejs-1920x1200.png"
});

The above is the details of the first experience of node. js crawler framework node-crawler. Please pay attention to other related articles on this site for more information about crawler framework node-crawler!