How to write a crawler gracefully in c

  • 2020-05-07 20:04:30
  • OfStack

We at ordinary times more or less will have the need to write a web crawler. Generally speaking, python is the first choice for writing crawlers. Besides, java and other languages are also good choices. The reason for choosing the above languages is not only that they all have very good network request libraries and string processing libraries, but also that there are many and perfect crawler frameworks based on the above languages. A good crawler framework can ensure the stability of the crawler and the convenience of writing the program. So the mission of the cspider crawler library is that we can use c and still write crawlers gracefully.

1. Characteristics of crawler

is easy to configure. USES a one-sentence setup function to define user agent, cookie, timeout, proxy and the maximum number of grab and parse threads.
program logic independence. users can define the crawler's parsing function and data persistence function respectively. And for the new url resolved, the user can add it to the task queue using the addUrl function provided by cspider.
convenient string handling. cspider provides simple regular expression functions based on pcre, xpath parsing functions based on libxml2, and cJSON libraries for parsing json.
efficient scraping. cspider schedules fetching and parsing threads based on libuv and USES curl as its network request library.
2, steps to use cspider

Obtain cspider_t. Customize user agent, cookie, timeout, proxy and the maximum number of grab and parse threads. Add url to the task queue to initially fetch. Write parsing functions and data persistence functions. Start the crawler.

3. Example

Let's take a look at a simple crawler example, which will be explained in more detail later.


#include<cspider/spider.h>

/*
   Custom analytic functions, d To obtain html Page string 
*/
void p(cspider_t *cspider, char *d, void *user_data) {

 char *get[100];
 //xpath parsing html
 int size = xpath(d, "//body/div[@class='wrap']/div[@class='sort-column area']/div[@class='column-bd cfix']/ul[@class='st-list cfix']/li/strong/a", get .  100);

 int i;
 for (i = 0; i < size; i++) {
 // Will get the movie name, persist 
  saveString(cspider, get[i]);
 }

}
/*
   Data persistence function called on the above parse function saveString() Function incoming data, carried in 1 The preservation of the step 
*/
void s(void *str, void *user_data) {
 char *get = (char *)str;
 FILE *file = (FILE*)user_data;
 fprintf(file, "%s\n", get);
 return;
}

int main() {
 // Initialize the spider
 cspider_t *spider = init_cspider();
 char *agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0";
 //char *cookie = "bid=s3/yuH5Jd/I; ll=108288; viewed=1130500_24708145_6433169_4843567_1767120_5318823_1899158_1271597; __utma=30149280.927537245.1446813674.1446983217.1449139583.4; __utmz=30149280.1449139583.4.4.utmcsr=accounts.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/login; ps=y; ue=965166527@qq.com; dbcl2=58742090:QgZ2PSLiDLQ; ck=T9Wn; push_noty_num=0; push_doumail_num=7; ap=1; __utmb=30149280.0.10.1449139583; __utmc=30149280";

 // Set to crawl the page url
 cs_setopt_url(spider, "so.tv.sohu.com/list_p1100_p20_p3_u5185_u5730_p40_p5_p6_p77_p80_p9_2d1_p101_p11.html");
 // Set up the user agent
 cs_setopt_useragent(spider, agent);
 //cs_setopt_cookie(spider, cookie);
 // Passing in Pointers to the parse function and the data persistence function 
 cs_setopt_process(spider, p, NULL);
 //s Function of the user_data Pointer to stdout
 cs_setopt_save(spider, s, stdout);
 // Set the number of threads 
 cs_setopt_threadnum(spider, DOWNLOAD, 2);
 cs_setopt_threadnum(spider, SAVE, 2);
 //FILE *fp = fopen("log", "wb+");
 //cs_setopt_logfile(spider, fp);
 // Start the crawler 
 return cs_run(spider);
}

example

cspider_t *spider = init_cspider(); gets the original cspider. Functions like cs_setopt_xxx can be used to initialize Settings. One thing to note: cs_setopt_process(spider,p,NULL); and cs_setopt_save (spider s, stdout); they set the parse function p and the data persistence function s respectively, which need to be implemented by the user, and the user-defined pointer to the context information user_data. In the parse function, the user defines the rules to be parsed, and the parsed string can be persisted by calling saveString, or by calling addUrl to add url to the task queue. Strings passed in saveString are processed in user-defined data persistence functions. At this point, the user can choose to output to a file or database, and so on.
Finally, cs_run(spider) is called to start the crawler.

Use the cspider crawler framework to write crawlers!


Related articles: