PHP counts the search engine crawling 404 link page path in nginx access log

  • 2021-07-06 10:33:17
  • OfStack

I have the habit of cutting nginx logs every day on the server. Therefore, in view of the daily visits of major search engines, Can always record 1 404 pages of information, Traditionally, I only analyze the log occasionally, but for many friends who log information, it may not be easy to manually screen it. I personally studied it slowly for 1 point. The 404 access to search engines such as Google, Baidu, Sousou, 360 Search, Yisou, sogou and Bing became an txt text file, which was directly coded test.php.


<?php
// Visit test.php?s=google
$domain='https://www.ofstack.com';
$spiders=array('baidu'=>'Baiduspider','360'=>'360Spider',
'google'=>'Googlebot','soso'=>'Sosospider','sogou'=>
'Sogou web spider','easou'=>'EasouSpider','bing'=>'bingbot');
 
$path='/home/nginx/logs/'.date('Y/m/').(date('d')-1).'/access_www.txt';
 
$s=$_GET['s'];
 
if(!array_key_exists($s,$spiders)) die();
$spider=$spiders[$s];
 
$file=$s.'_'.date('ym').(date('d')-1).'.txt';
if(!file_exists($file)){
    $in=file_get_contents($path);
    $pattern='/GET (.*) HTTP\/1.1" 404.*'.$spider.'/';
    preg_match_all ( $pattern , $in , $matches );
    $out='';
    foreach($matches[1] as $k=>$v){
        $out.=$domain.$v."\r\n";
    }
    file_put_contents($file,$out);
}
 
$url=$domain.'/silian/'.$file;
echo $url;

Okay, that's it. There is no advanced technology, only the process of writing by hand.


Related articles: