A Complete Example of Timing Crawler Realized by Nodejs
- 2021-11-01 02:13:24
- OfStack
Implement Timing Task 1 with Node Schedule. Install node-schedule
2. Basic usage
3. Advanced usage
Step 4 Terminate the mission
Summarize
Causes of the incident
Two days ago, I had to help my friend B station captain group audit, and one by one went to the captain list to find it. Naturally, it was not the first choice for a Cheng Xuyuan. Give the task to the computer and let him do it himself. Fishing is the right way. The theory was established at the beginning of Coding.
Because the API crawler with known captain list uses Axios direct access interface,
So I spent billions of minutes writing this crawler I call bilibili-live-captain-tools 1.0
const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `https://api.live.bilibili.com/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&ruid=${ruid}&page_size=30`
const Captin = {
1: ' Governor ',
2: ' Prefect ',
3: ' Captain '
}
const reqPromise = url => axios.get(url);
let CaptinList = []
let UserList = []
async function crawler(URL, pageNow) {
const res = await reqPromise(URL);
if (pageNow == 1) {
CaptinList = CaptinList.concat(res.data.data.top3);
}
CaptinList = CaptinList.concat(res.data.data.list);
}
function getMaxPage(res) {
const Info = res.data.data.info
const { page: maxPage } = Info
return maxPage
}
function getUserList(res) {
for (let item of res) {
const userInfo = item
const { uid, username, guard_level } = userInfo
UserList.push({ uid, username, Captin: Captin[guard_level] })
}
}
async function main(UID) {
const maxPage = await reqPromise(`${url}&page=1`).then(getMaxPage)
for (let pageNow = 1; pageNow < maxPage + 1; pageNow++) {
const URL = `${url}&page=${pageNow}`;
await crawler(URL, pageNow);
}
getUserList(CaptinList)
console.log(search(UID, UserList))
return search(UID, UserList)
}
function search(uid, UserList) {
for (let i = 0; i < UserList.length; i++) {
if (UserList[i].uid === uid) {
return UserList[i];
}
}
return 0
}
module.exports = {
main
}
Obviously, this crawler can only be triggered manually, and it needs a command line and node environment to run directly, so he opened a page service with Koa2 and wrote an extremely humble page
const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')
app.use(views(path.join(__dirname, './'), {
extension: 'ejs'
}))
app.use(router.routes());
router.get('/', async ctx => {
ctx.response.type = 'html';
ctx.response.body = fs.createReadStream('./index.html');
})
router.get('/api/captin', async (ctx) => {
const UID = ctx.request.query.uid
console.log(UID)
const Info = await index.main(parseInt(UID))
await ctx.render('index', {
Info,
})
});
app.listen(3000);
Because the page has no throttling and anti-shake, the current version can only crawl in real time, and the waiting time is long. Frequent refreshing will naturally trigger the anti-crawler mechanism of b station, so the current server ip is under risk control.
So bilibili-live-captain-tools 2.0 was born
function throttle(fn, delay) {
var timer;
return function () {
var _this = this;
var args = arguments;
if (timer) {
return;
}
timer = setTimeout(function () {
fn.apply(_this, args);
timer = null; // In delay After the execution is finished fn Then empty timer At this time timer As false, throttle Trigger can enter timer
}, delay)
}
}
Add throttling and anti-shake, and use pseudo-real-time crawler (crawl once a minute through timed tasks)
In this case, we need to execute the crawler script regularly. At this time, I thought that I could use the schedule function of egg, but I don't want a crawler to be so "overqualified". So there is the following plan
Using Node Schedule to implement timing tasks
Node Schedule is a flexible class cron and non-cron job scheduler for Node. js. It allows you to schedule jobs (arbitrary functions) for execution on a specific date using optional repeating rules. It uses only 1 timer at any given time (instead of reevaluating upcoming jobs every second/minute).
1. Install node-schedule
npm install node-schedule
# Or
yarn add node-schedule
2. Basic usage
1 case, look at the example given by the official
const schedule = require('node-schedule');
const job = schedule.scheduleJob('42 * * * *', function(){
console.log('The answer to life, the universe, and everything!');
});
The first parameter of schedule. scheduleJob needs to be entered as follows according to the rules
Node Schedule rules are represented in the following table
* * * * * *
It is a matter of making a fuss and making a fuss
The, the, the, the
The, the, the, the day of the week, the value: 0-7, where 0 and 7 both mean Sunday
----month, value: 1-12
-----date, value: 1-31
When the value is 0-23, the value is 0-23
The--------points, values: 0-59
--------seconds, value: 0-59 (optional)
You can also specify a specific time, such as: const date = new Date ()
Understand the rules and realize one by ourselves
const schedule = require('node-schedule');
// Definition 1 Time
let date = new Date(2021, 3, 10, 12, 00, 0);
// Definition 1 Tasks
let job = schedule.scheduleJob(date, () => {
console.log(" Now time :",new Date());
});
The above example means that the time will be told at 12 o'clock on March 10, 2021
3. Advanced usage
In addition to the basic usage, we can also use one more flexible method to implement timing tasks.
3.1. Execute once every 1 minute
const schedule = require('node-schedule');
// Define rules
let rule = new schedule.RecurrenceRule();
rule.second = 0
// Per minute 0 Second execution 1 Times
// Start a task
let job = schedule.scheduleJob(rule, () => {
console.log(new Date());
});
The values supported by rule are second, minute, hour, date, dayOfWeek, month, year, etc.
Some common rules are as follows
Execute per second
rule.second = [0,1,2,3......59];
0 seconds per minute
rule.second = 0;
Executed at 30 minutes per hour
rule.minute = 30;
rule.second = 0;
Execute at 0:00 every day
rule.hour =0;
rule.minute =0;
rule.second =0;
Execute at 10:00 on the 1st of each month
rule.date = 1;
rule.hour = 10;
rule.minute = 0;
rule.second = 0;
Execute at 0:00 and 12:00 on Monday, Wednesday and Friday every week
rule.dayOfWeek = [1,3,5];
rule.hour = [0,12];
rule.minute = 0;
rule.second = 0;
Step 4 Terminate the mission
You can use cancel () to terminate a running task. Cancel and terminate the task in time when the task is abnormal
job.cancel();
Summarize
node-schedule is a timed task (crontab) module of Node. js. We can use timed tasks to maintain the server system, let it perform some necessary operations in a fixed time period, and also use timed tasks to send mail, crawl data and so on;