python regular analysis of access logs for nginx

  • 2020-05-19 05:11:36
  • OfStack

preface

Script of this paper is to analyze nginx access log, mainly in order to check the site uri visits, inspection results will provide developers for your reference, because when it comes to analysis, it must to use regular expressions, so please don't have contact with the regular friend to brain, because involves regular content, really can't begin to write, regular content is too big, not 1 article two can write clear.

Before we start, let's look at the log structure to analyze:


127.0.0.1 - - [19/Jun/2012:09:16:22 +0100] "GET /GO.jpg HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; SE 2.X MetaSr 1.0)"
127.0.0.1 - - [19/Jun/2012:09:16:25 +0100] "GET /Zyb.gif HTTP/1.1" 499 0 "http://domain.com/htm_data/7/1206/758536.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QQDownload 711; SV1; .NET4.0C; .NET4.0E; 360SE)"

This is the modified log content, sensitive material to remove or replace, but does not affect our analysis results, the format of what it is not important, of course, Nginx access log can be custom, every company could be slightly different, so need to be able to understand the content of the script, and by modifying their application is the key to his work, I will give the log format is just a reference, I bet you see on your server log format is the format of the no 1 sample with me, watching the log format, we start to write our script

I'll post the code first and explain later:


import re
from operator import itemgetter
 
def parser_logfile(logfile):
 pattern = (r''
   '(\d+.\d+.\d+.\d+)\s-\s-\s' #IP address
   '\[(.+)\]\s' #datetime
   '"GET\s(.+)\s\w+/.+"\s' #requested file
   '(\d+)\s' #status
   '(\d+)\s' #bandwidth
   '"(.+)"\s' #referrer
   '"(.+)"' #user agent
  )
 fi = open(logfile, 'r')
 url_list = []
 for line in fi:
  url_list.append(re.findall(pattern, line))
 fi.close()
 return url_list
 
def parser_urllist(url_list):
 urls = []
 for url in url_list:
  for r in url: 
   urls.append(r[5])
 return urls
 
def get_urldict(urls):
 d = {}
 for url in urls:
  d[url] = d.get(url,0)+1
 return d
 
def url_count(logfile):
 url_list = parser_logfile(logfile)
 urls = parser_urllist(url_list)
 totals = get_urldict(urls)
 return totals
 
if __name__ == '__main__':
 urls_with_counts = url_count('example.log')
 sorted_by_count = sorted(urls_with_counts.items(), key=itemgetter(1), reverse=True)
 print(sorted_by_count)

Script interpretation, parser_logfile() The function is to analyze the log and return a list of matched rows and columns, and I'm not going to explain the regular part, but if you look at the comments you should know what it matches, parser_urllist() The url function retrieves the user's access to url, get_urldict() The function returns 1 dictionary with url as the key. If the same key is added by 1, the returned dictionary is the maximum number of visits per url. url_count() Function function is to call the function defined before, the main function part, let's talk about itemgetter, it can achieve the order by the specified elements, for example:


>>> from operator import itemgetter
>>> a=[('b',2),('a',1),('c',0)] 
>>> s=sorted(a,key=itemgetter(1))
>>> s
[('c', 0), ('a', 1), ('b', 2)]
>>> s=sorted(a,key=itemgetter(0))
>>> s
[('a', 1), ('b', 2), ('c', 0)]

The parameter reverse=True represents the descending sort, that is, the sort from large to small, and the script runs:


[('http://domain.com/htm_data/7/1206/758536.html', 141), ('http://domain.com/?q=node&page=12', 3), ('http://website.net/htm_data/7/1206/758536.html', 1)]

conclusion


Related articles: