node. js crawler framework node crawler first experience
- 2021-09-11 19:14:07
- OfStack
Baidu crawler is a word, and all the information related to python appears.
py also has many crawler frameworks, such as scrapy, Portia, Crawley and so on.
Before, I personally preferred to use C # as a reptile.
With the familiarity with nodejs. I found it better to do this in a scripting language, at least without writing so many entity classes. Moreover, Script 1 is relatively simple to use.
Search for node+spider on github, and the number one is node-crawler
github:https://github.com/bda-research/node-crawler
Simple use
npm installation:
npm install crawler
new1 crawler Objects
var c = new Crawler({
// This callback function is called after each request is processed
callback : function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
// $ Default to Cheerio Parser
// It's the core jQuery The streamlined implementation can be based on jQuery Fast extraction of selector syntax DOM Element
console.log($("title").text());
}
done();
}
});
Then add url to the crawler queue,
// Will 1 A URL Join the request queue and use the default callback function
c.queue('http://www.amazon.com');
// Will be multiple URL Join the request queue
c.queue(['http://www.google.com/','http://www.yahoo.com']);
Control concurrency speed
Crawler Framework 1 is to crawl multiple pages at the same time, but too fast will trigger the anti-crawler mechanism of the target website, and also affect the performance of other websites.
Control the maximum number of concurrency
var c = new Crawler({
// The maximum number of concurrency defaults to 10
maxConnections : 1,
callback : function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
console.log($("title").text());
}
done();
}
});
Use slow mode
Enable slow mode with the parameter rateLimit, rateLimit is idle for milliseconds between requests, and maxConnections is forcibly modified to 1.
var c = new Crawler({
// `maxConnections` Will be forced to modify to 1
maxConnections : 10,
// Idle between requests 1000ms
rateLimit: 1000,
callback : function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
console.log($("title").text());
}
done();
}
});
Download static files such as pictures
var c = new Crawler({
encoding:null,
jQuery:false,// set false to suppress warning message.
callback:function(err, res, done){
if(err){
console.error(err.stack);
}else{
fs.createWriteStream(res.options.filename).write(res.body);
}
done();
}
});
c.queue({
uri:"https://nodejs.org/static/images/logos/nodejs-1920x1200.png",
filename:"nodejs-1920x1200.png"
});
The above is the details of the first experience of node. js crawler framework node-crawler. Please pay attention to other related articles on this site for more information about crawler framework node-crawler!