Analyze the cdn logs using the pandas library in Python

  • 2020-05-27 05:57:35
  • OfStack

preface

Recently, I met a requirement to filter some data according to the CDN log, such as traffic, status code statistics, TOP IP, URL, UA, Referer and so on. Previously, bash shell was used. However, when the log volume was large, the number of log files was G, and the number of lines was tens of millions of times, the processing time through shell was too long. So I studied the use of Python pandas, a data processing library. 10 million line logs, processed around 40s.

code


#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
 Log format 
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
 The log sample 
 [0] [1][2]  [3]  [4]   [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -" 
[6] [7] [8]    [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len(sys.argv) != 2:
 print('Usage:', sys.argv[0], 'file_of_log')
 exit() 
else:
 log_file = sys.argv[1] 
#  The log location corresponding to the required statistics field  
ip  = 0
url  = 5
status_code = 6
size = 7
referer = 8
ua  = 9
#  Read in the log DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
 try:
 chunk = reader.get_chunk(chunkSize)
 chunks.append(chunk)
 except StopIteration:
 #Iteration is stopped.
 loop = False
df = pd.concat(chunks, ignore_index=True)
byte_sum = df[size].sum()        # Traffic statistics 
top_status_code = pd.DataFrame(df[6].value_counts())      # Status code statistics 
top_ip  = df[ip].value_counts().head(10)      #TOP IP
top_referer = df[referer].value_counts().head(10)      #TOP Referer
top_ua  = df[ua].value_counts().head(10)      #TOP User-Agent
top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)
top_url  = df[url].value_counts().head(10)      #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the highest traffic requests URL
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the most traffic requests IP
#  Store the results in the dictionary in order 
result = OrderedDict([(" Traffic in total [ unit :GB]:"   , byte_sum/1024/1024/1024),
   (" Status code statistics [ The number of | The percentage ]:"  , top_status_code),
   ("IP TOP 10:"    , top_ip),
   ("Referer TOP 10:"   , top_referer),
   ("UA TOP 10:"    , top_ua),
   ("URL TOP 10:"   , top_url),
   (" The one with the highest traffic requests URL TOP 10[ unit :MB]:" , top_url_byte), 
   (" The one with the highest traffic requests IP TOP 10[ unit :MB]:" , top_ip_byte)
])
#  The output 
for k,v in result.items():
 print(k)
 print(v)
 print('='*80)

pandas study notes

There are two basic data structures in Pandas, Series and Dataframe. Series is a kind of object similar to a 1-dimensional array, consisting of a set of data and an index. Dataframe is a tabular data structure with both row and column indexes.


from pandas import Series, DataFrame
import pandas as pd

Series


In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]: 
0 4
1 7
2 -5
3 3

The string representation of Series is: index on the left and value on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. The array representation and index object can be obtained through Series's values and index properties:


In [3]: obj.values
Out[3]: array([ 4, 7, -5, 3])
In [4]: obj.index
Out[4]: RangeIndex(start=0, stop=4, step=1)

Indexes are usually specified when Series is created:


In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]: 
d 4
b 7
a -5
c 3

Get a single or 1 group of values in Series by index:


In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]: 
c 3
d 4

The sorting


In [9]: obj2.sort_index()
Out[9]: 
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]: 
a -5
c 3
d 4
b 7

Filter operation


In [11]: obj2[obj2 > 0]
Out[11]: 
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]: 
d 8
b 14
a -10
c 6

Members of the


In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False

Create Series from the dictionary


In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]: 
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000

If only one dictionary is passed in, the resulting index in Series is the key of the original dictionary (ordered order)


from pandas import Series, DataFrame
import pandas as pd
0

When index is specified, the three values in sdata that match the states index are found and placed in the response position, but since the sdata value corresponding to 'Suzhou' cannot be found, the result is NaN(not a number), which is used in pandas to indicate missing or NA values

pandas's isnull and notnull functions can be used to detect missing data:


In [21]: pd.isnull(obj4)
Out[21]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: pd.notnull(obj4)
Out[22]: 
Beijing True
Hangzhou True
Shanghai True
Suzhou False

Series has a similar instance method


In [23]: obj4.isnull()
Out[23]: 
Beijing False
Hangzhou False
Shanghai False
Suzhou True

An important feature of Series is that it automatically aligns data of different indexes in data operations


from pandas import Series, DataFrame
import pandas as pd
3

The index of Series can be modified in place by copying


from pandas import Series, DataFrame
import pandas as pd
4

DataFrame

pandas reads files


from pandas import Series, DataFrame
import pandas as pd
5

DataFrame column selection


from pandas import Series, DataFrame
import pandas as pd
6

In [31]: df['name']
Out[31]: 
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

DataFrame line selection


df.iloc[0,:] # The first 1 The argument is line number, line number 2 Three parameters are the columns. This refers to the first 0 Rows all columns 
df.iloc[:,0] # All lines, no 0 column 

from pandas import Series, DataFrame
import pandas as pd
9

Get 1 element by iloc, or faster by iat


In [34]: df.iloc[1,1]
Out[34]: 22
In [35]: df.iat[1,1]
Out[35]: 22

DataFrame block selection


In [36]: df.loc[1:2,['name','age']]
Out[36]: 
 name age
1 Loya 22
2 Denny 20

Filter the rows according to the conditions

Filter the rows by adding a judgment condition in square brackets that must return either True or False


In [37]: df[(df.index >= 1) & (df.index <= 3)]
Out[37]: 
 name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df['age'] > 22]
Out[38]: 
 name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing

Increase the column


In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']
In [40]: df
Out[40]: 
 name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing

The sorting

Sort by the specified column


In [41]: df.sort_values(by='age')
Out[41]: 
 name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing

#  The introduction of numpy  build  DataFrame
import numpy as np

In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [43]: df
Out[43]: 
 d a b c
three 0 1 2 3
one 4 5 6 7

#  Sort by index 
In [44]: df.sort_index()
Out[44]: 
 d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]: 
 a b c d
three 1 2 3 0
one 5 6 7 4
#  Descending order 
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]: 
 d c b a
three 0 3 2 1
one 4 7 6 5

To view


#  Check the header 5 line  
df.head(5)
#  Look at the end of the table 5 line 
df.tail(5) 
#  View the name of the column 
In [47]: df.columns
Out[47]: Index(['name', 'age', 'city'], dtype='object')
#  View the current value of the table 
In [48]: df.values
Out[48]: 
array([['Bob', 26, 'Beijing'],
 ['Loya', 22, 'Shanghai'],
 ['Denny', 20, 'Hangzhou'],
 ['Mars', 25, 'Nanjing']], dtype=object)

transpose


df.T
Out[49]: 
  0  1  2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing

Using isin


In [50]: df2 = df.copy()
In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]
Out[52]: 
 name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing

Operation:


In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
 ...:    index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
In [54]: df
Out[54]: 
 one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

# According to the column sum 
In [55]: df.sum()
Out[55]: 
one 9.25
two -5.80
#  According to the row sum 
In [56]: df.sum(axis=1)
Out[56]: 
a 1.40
b 2.60
c NaN
d -0.55

group

group refers to the following steps:

Splitting the data into groups based on some criteria Applying a function to each group independently Combining the results into a data structure

See the Grouping section


In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
 ....:    'foo', 'bar', 'foo', 'foo'],
 ....:   'B' : ['one', 'one', 'two', 'three',
 ....:    'two', 'two', 'one', 'three'],
 ....:   'C' : np.random.randn(8),
 ....:   'D' : np.random.randn(8)})
 ....: 
In [58]: df
Out[58]: 
 A B  C  D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033

group1, and then sum


In [59]: df.groupby('A').sum()
Out[59]: 
  C D
A   
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: df.groupby(['A','B']).sum()
Out[60]: 
   C  D
A B   
bar one -1.814470 2.395985
 three -0.595447 0.166599
 two -0.392670 -0.136473
foo one -1.195665 -0.616981
 three 1.928123 -1.623033
 two 2.414034 1.600434

conclusion


Related articles: