NodeJS url interception module url extract use example

  • 2020-03-29 23:48:55
  • OfStack

Last time, we introduced how to use NodeJS + PhantomJS to take a screenshot, but since a PhantomJS process was enabled for each screenshot operation, the efficiency was worried after the amount of concurrency was increased, so we rewrote all the code and made it into a separate module for easy invocation.
How to improve? Controls the number of threads and the number of urls to be processed by a single thread. Communicate using Standard Output & WebSocket. Add a caching mechanism, currently using Javascript Object. To provide a simple interface.

The design

< img Alt = "" border = 0 SRC =" / / files.jb51.net/file_images/article/201311/201311180920192.jpg ">

 

Dependency & installation

$phantomjs - v

If you can return version 1.9.x, you can proceed. If the version is too low or an error occurs, go to (link: http://phantomjs.org/) to download the latest version.

If you already have Git installed or have a Git Shell, type:
$NPM install url - extract

Install.

A simple example

For example, we want to intercept baidu home page, so can be like this:

 module.exports = (function () { "use strict" var urlExtract = require('url-extract'); urlExtract.snapshot('http://www.baidu.com', function (job) { console.log('This is a snapshot example.'); console.log(job); process.exit(); }); })(); 

Here is the print:

< img Alt = "" border = 0 SRC =" / / files.jb51.net/file_images/article/201311/201311180920193.png ">

The image attribute is the address of the screenshot relative to the working path. We can use Job's getData interface to get clearer data, for example:

 module.exports = (function () { "use strict" var urlExtract = require('url-extract'); urlExtract.snapshot('http://www.baidu.com', function (job) { console.log('This is a snapshot example.'); console.log(job.getData()); process.exit(); }); })(); 

This is what printing looks like:

< img Alt = "" border = 0 SRC =" / / files.jb51.net/file_images/article/201311/201311180920194.png ">

Image represents the screenshot relative to the address of the working path, status indicates whether the status is normal, true represents normal, and false represents the failure of the screenshot.

More examples see: (link: https://github.com/miniflycn/url-extract/tree/master/examples)

 

The main API

. The snapshot

Url snapshot

Snapshot (url, [callback]).snapshot(urls, [callback]).snapshot(url, [callback]).snapshot(url, [option]).snapshot(urls, [option]).snapshot(urls, [option])
 url {String}  The address to intercept  urls {Array}  The array of addresses to intercept  callback {Function}  The callback function  option {Object}  Optional parameters   ┝  id {String}  The custom url the id If the first parameter is urls , this parameter is invalid   ┝  image {String}  Customize the save address of the screenshot if the first parameter is urls , this parameter is invalid   ┝  groupId {String}  Define a set of url the groupId , which is used to identify which group on return url  ┝  ignoreCache {Boolean}  Whether to ignore the cache or not   ┗  callback {Function}  The callback function  

The extract

Grab the url information and take a snapshot

Extract (url, [callback]). Extract (url, [callback]). Extract (url, [option]). Extract (urls, [option]).

Url {String} the address to intercept

Urls {Array} the Array of addresses to be intercepted

Callback {Function} callback Function

Option {Object} optional parameter

┝ id {String} custom url id, if the first parameter is the urls, this parameter is null and void

┝ image {String} custom screenshot save the address, if the first parameter is the urls, this parameter is null and void

┝ groupId {String} defines a set of url groupId, which is used to return to the time to identify a set of url

┝   IgnoreCache  {Boolean} whether to ignore the cache

┗   Callback {Function} callback Function

The Job (class)

Each url corresponds to a job object, and the relevant information of the url is stored by the job object.

The Field

Content {Boolean} is the url {String} link address content {Boolean} is the page title and description id {String} idgroupId {String} job idgroupId {String} job group idcache {Boolean} is the cache callback {Function} callback Function image {String} picture address status {Boolean} job currently normal

The Prototype

GetData () gets the relevant data for the job

 

Global configuration

The config file in the url-extract root directory can be configured globally. The default is as follows:

Module. exports = {wsPort: 3001, maxJob: 100, maxQueueJob: 400, cache: 'object', maxCache: 10000, workerNum: 0};

WsPort {Number} port address occupied by websocket maxJob {Number} each PhantomJS thread can concurrently Number of workers maxQueueJob {Number} maximum Number of waiting jobs, 0 means no limit, any work will directly return failure (status = false) cache {String} cache implementation, Currently, only object implements maxCache {Number} and the maximum Number of cache links is workerNum {Number} PhantomJS threads

 

A simple service example

(link: https://github.com/miniflycn/url-extract-server-example)

Note that connect and url-extract need to be installed:

$NPM install

If you have downloaded the disk files, please install connect:

$NPM install connect

Then type:

$node bin/server

Open:

(link: http://localhost:3000)

View the effect.

 

;


Related articles: