A Complete Example of Timing Crawler Realized by Nodejs

  • 2021-11-01 02:13:24
  • OfStack

Directory event cause
Implement Timing Task 1 with Node Schedule. Install node-schedule
2. Basic usage
3. Advanced usage
Step 4 Terminate the mission
Summarize

Causes of the incident

Two days ago, I had to help my friend B station captain group audit, and one by one went to the captain list to find it. Naturally, it was not the first choice for a Cheng Xuyuan. Give the task to the computer and let him do it himself. Fishing is the right way. The theory was established at the beginning of Coding.

Because the API crawler with known captain list uses Axios direct access interface,

So I spent billions of minutes writing this crawler I call bilibili-live-captain-tools 1.0


const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `https://api.live.bilibili.com/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&ruid=${ruid}&page_size=30`

const Captin = {
 1: ' Governor ',
 2: ' Prefect ',
 3: ' Captain '
}

const reqPromise = url => axios.get(url);

let CaptinList = []
let UserList = []

async function crawler(URL, pageNow) {
 const res = await reqPromise(URL);
 if (pageNow == 1) {
 CaptinList = CaptinList.concat(res.data.data.top3);
 }
 CaptinList = CaptinList.concat(res.data.data.list);
}


function getMaxPage(res) {

 const Info = res.data.data.info
 const { page: maxPage } = Info
 return maxPage
}


function getUserList(res) {

 for (let item of res) {
 const userInfo = item
 const { uid, username, guard_level } = userInfo
 UserList.push({ uid, username, Captin: Captin[guard_level] })
 }
}

async function main(UID) {
 const maxPage = await reqPromise(`${url}&page=1`).then(getMaxPage)
 for (let pageNow = 1; pageNow < maxPage + 1; pageNow++) {
 const URL = `${url}&page=${pageNow}`;
 await crawler(URL, pageNow);
 }
 getUserList(CaptinList)
 console.log(search(UID, UserList))
 return search(UID, UserList)
}

function search(uid, UserList) {
 for (let i = 0; i < UserList.length; i++) {
 if (UserList[i].uid === uid) {
 return UserList[i];
 }
 }
 return 0
}

module.exports = {
 main
}

Obviously, this crawler can only be triggered manually, and it needs a command line and node environment to run directly, so he opened a page service with Koa2 and wrote an extremely humble page


const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')



app.use(views(path.join(__dirname, './'), {
 extension: 'ejs'
}))
app.use(router.routes());

router.get('/', async ctx => {
 ctx.response.type = 'html';
 ctx.response.body = fs.createReadStream('./index.html');
})

router.get('/api/captin', async (ctx) => {
 const UID = ctx.request.query.uid
 console.log(UID)
 const Info = await index.main(parseInt(UID))
 await ctx.render('index', {
 Info,
 })
});

app.listen(3000);

Because the page has no throttling and anti-shake, the current version can only crawl in real time, and the waiting time is long. Frequent refreshing will naturally trigger the anti-crawler mechanism of b station, so the current server ip is under risk control.

So bilibili-live-captain-tools 2.0 was born


function throttle(fn, delay) {
 var timer;
 return function () {
 var _this = this;
 var args = arguments;
 if (timer) {
  return;
 }
 timer = setTimeout(function () {
  fn.apply(_this, args);
  timer = null; //  In delay After the execution is finished fn Then empty timer At this time timer As false, throttle Trigger can enter timer 
 }, delay)
 }
}

Add throttling and anti-shake, and use pseudo-real-time crawler (crawl once a minute through timed tasks)

In this case, we need to execute the crawler script regularly. At this time, I thought that I could use the schedule function of egg, but I don't want a crawler to be so "overqualified". So there is the following plan

Using Node Schedule to implement timing tasks

Node Schedule is a flexible class cron and non-cron job scheduler for Node. js. It allows you to schedule jobs (arbitrary functions) for execution on a specific date using optional repeating rules. It uses only 1 timer at any given time (instead of reevaluating upcoming jobs every second/minute).

1. Install node-schedule


npm install node-schedule
#  Or 
yarn add node-schedule

2. Basic usage

1 case, look at the example given by the official


const schedule = require('node-schedule');

const job = schedule.scheduleJob('42 * * * *', function(){
 console.log('The answer to life, the universe, and everything!');
});

The first parameter of schedule. scheduleJob needs to be entered as follows according to the rules

Node Schedule rules are represented in the following table

* * * * * *
It is a matter of making a fuss and making a fuss
The, the, the, the
The, the, the, the day of the week, the value: 0-7, where 0 and 7 both mean Sunday
----month, value: 1-12
-----date, value: 1-31
When the value is 0-23, the value is 0-23
The--------points, values: 0-59
--------seconds, value: 0-59 (optional)
You can also specify a specific time, such as: const date = new Date ()

Understand the rules and realize one by ourselves


const schedule = require('node-schedule');

//  Definition 1 Time 
let date = new Date(2021, 3, 10, 12, 00, 0);

//  Definition 1 Tasks 
let job = schedule.scheduleJob(date, () => {
 console.log(" Now time :",new Date());
});

The above example means that the time will be told at 12 o'clock on March 10, 2021

3. Advanced usage

In addition to the basic usage, we can also use one more flexible method to implement timing tasks.

3.1. Execute once every 1 minute


const schedule = require('node-schedule');

//  Define rules 
let rule = new schedule.RecurrenceRule();
rule.second = 0
// Per minute  0  Second execution 1 Times 

//  Start a task 
let job = schedule.scheduleJob(rule, () => {
 console.log(new Date());
});

The values supported by rule are second, minute, hour, date, dayOfWeek, month, year, etc.

Some common rules are as follows

Execute per second
rule.second = [0,1,2,3......59];
0 seconds per minute
rule.second = 0;
Executed at 30 minutes per hour
rule.minute = 30;
rule.second = 0;
Execute at 0:00 every day
rule.hour =0;
rule.minute =0;
rule.second =0;
Execute at 10:00 on the 1st of each month
rule.date = 1;
rule.hour = 10;
rule.minute = 0;
rule.second = 0;
Execute at 0:00 and 12:00 on Monday, Wednesday and Friday every week
rule.dayOfWeek = [1,3,5];
rule.hour = [0,12];
rule.minute = 0;
rule.second = 0;

Step 4 Terminate the mission

You can use cancel () to terminate a running task. Cancel and terminate the task in time when the task is abnormal


job.cancel();

Summarize

node-schedule is a timed task (crontab) module of Node. js. We can use timed tasks to maintain the server system, let it perform some necessary operations in a fixed time period, and also use timed tasks to send mail, crawl data and so on;


Related articles: