Analyze the cdn logs using the pandas library in Python


preface

Recently, I met a requirement to filter some data according to the CDN log, such as traffic, status code statistics, TOP IP, URL, UA, Referer and so on. Previously, bash shell was used. However, when the log volume was large, the number of log files was G, and the number of lines was tens of millions of times, the processing time through shell was too long. So I studied the use of Python pandas, a data processing library. 10 million line logs, processed around 40s.

code

#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
 Log format
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
 The log sample
 [0] [1][2]  [3]  [4]   [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -"
[6] [7] [8]    [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len(sys.argv) != 2:
 print('Usage:', sys.argv[0], 'file_of_log')
 exit()
else:
 log_file = sys.argv[1]
#  The log location corresponding to the required statistics field
ip  = 0
url  = 5
status_code = 6
size = 7
referer = 8
ua  = 9
#  Read in the log DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
 try:
 chunk = reader.get_chunk(chunkSize)
 chunks.append(chunk)
 except StopIteration:
 #Iteration is stopped.
 loop = False
df = pd.concat(chunks, ignore_index=True)
byte_sum = df[size].sum()        # Traffic statistics
top_status_code = pd.DataFrame(df[6].value_counts())      # Status code statistics
top_ip  = df[ip].value_counts().head(10)      #TOP IP
top_referer = df[referer].value_counts().head(10)      #TOP Referer
top_ua  = df[ua].value_counts().head(10)      #TOP User-Agent
top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)
top_url  = df[url].value_counts().head(10)      #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the highest traffic requests URL
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
   .round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the most traffic requests IP
#  Store the results in the dictionary in order
result = OrderedDict([(" Traffic in total [ unit :GB]:"   , byte_sum/1024/1024/1024),
   (" Status code statistics [ The number of | The percentage ]:"  , top_status_code),
   ("IP TOP 10:"    , top_ip),
   ("Referer TOP 10:"   , top_referer),
   ("UA TOP 10:"    , top_ua),
   ("URL TOP 10:"   , top_url),
   (" The one with the highest traffic requests URL TOP 10[ unit :MB]:" , top_url_byte),
   (" The one with the highest traffic requests IP TOP 10[ unit :MB]:" , top_ip_byte)
])
#  The output
for k,v in result.items():
 print(k)
 print(v)
 print('='*80)

pandas study notes

There are two basic data structures in Pandas, Series and Dataframe. Series is a kind of object similar to a 1-dimensional array, consisting of a set of data and an index. Dataframe is a tabular data structure with both row and column indexes.

from pandas import Series, DataFrame
import pandas as pd

Series

In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]:
0 4
1 7
2 -5
3 3

The string representation of Series is: index on the left and value on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. The array representation and index object can be obtained through Series’s values and index properties:

In [3]: obj.values
Out[3]: array([ 4, 7, -5, 3])
In [4]: obj.index
Out[4]: RangeIndex(start=0, stop=4, step=1)

Indexes are usually specified when Series is created:

In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]:
d 4
b 7
a -5
c 3

Get a single or 1 group of values in Series by index:

In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]:
c 3
d 4

The sorting

In [9]: obj2.sort_index()
Out[9]:
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]:
a -5
c 3
d 4
b 7

Filter operation

In [11]: obj2[obj2 > 0]
Out[11]:
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]:
d 8
b 14
a -10
c 6

Members of the

In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False

Create Series from the dictionary

In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]:
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000

If only one dictionary is passed in, the resulting index in Series is the key of the original dictionary (ordered order)

from pandas import Series, DataFrame
import pandas as pd

0

When index is specified, the three values in sdata that match the states index are found and placed in the response position, but since the sdata value corresponding to ‘Suzhou’ cannot be found, the result is NaN(not a number), which is used in pandas to indicate missing or NA values

pandas’s isnull and notnull functions can be used to detect missing data:

In [21]: pd.isnull(obj4)
Out[21]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: pd.notnull(obj4)
Out[22]:
Beijing True
Hangzhou True
Shanghai True
Suzhou False

Series has a similar instance method

In [23]: obj4.isnull()
Out[23]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True

An important feature of Series is that it automatically aligns data of different indexes in data operations

from pandas import Series, DataFrame
import pandas as pd

3

The index of Series can be modified in place by copying

from pandas import Series, DataFrame
import pandas as pd

4

DataFrame

pandas reads files

from pandas import Series, DataFrame
import pandas as pd

5

DataFrame column selection

from pandas import Series, DataFrame
import pandas as pd

6

In [31]: df['name']
Out[31]:
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object

DataFrame line selection

df.iloc[0,:] # The first 1 The argument is line number, line number 2 Three parameters are the columns. This refers to the first 0 Rows all columns
df.iloc[:,0] # All lines, no 0 column
from pandas import Series, DataFrame
import pandas as pd

9

Get 1 element by iloc, or faster by iat

In [34]: df.iloc[1,1]
Out[34]: 22
In [35]: df.iat[1,1]
Out[35]: 22

DataFrame block selection

In [36]: df.loc[1:2,['name','age']]
Out[36]:
 name age
1 Loya 22
2 Denny 20

Filter the rows according to the conditions

Filter the rows by adding a judgment condition in square brackets that must return either True or False

In [37]: df[(df.index >= 1) & (df.index <= 3)]
Out[37]:
 name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df['age'] > 22]
Out[38]:
 name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing

Increase the column

In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']
In [40]: df
Out[40]:
 name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing

The sorting

Sort by the specified column

In [41]: df.sort_values(by='age')
Out[41]:
 name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing
#  The introduction of numpy  build  DataFrame
import numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [43]: df
Out[43]:
 d a b c
three 0 1 2 3
one 4 5 6 7
#  Sort by index
In [44]: df.sort_index()
Out[44]:
 d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]:
 a b c d
three 1 2 3 0
one 5 6 7 4
#  Descending order
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]:
 d c b a
three 0 3 2 1
one 4 7 6 5

To view

#  Check the header 5 line
df.head(5)
#  Look at the end of the table 5 line
df.tail(5)
#  View the name of the column
In [47]: df.columns
Out[47]: Index(['name', 'age', 'city'], dtype='object')
#  View the current value of the table
In [48]: df.values
Out[48]:
array([['Bob', 26, 'Beijing'],
 ['Loya', 22, 'Shanghai'],
 ['Denny', 20, 'Hangzhou'],
 ['Mars', 25, 'Nanjing']], dtype=object)

transpose

df.T
Out[49]:
  0  1  2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing

Using isin

In [50]: df2 = df.copy()
In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]
Out[52]:
 name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing

Operation:

In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
 ...:    index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
In [54]: df
Out[54]:
 one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# According to the column sum
In [55]: df.sum()
Out[55]:
one 9.25
two -5.80
#  According to the row sum
In [56]: df.sum(axis=1)
Out[56]:
a 1.40
b 2.60
c NaN
d -0.55

group

group refers to the following steps:

Splitting the data into groups based on some criteria Applying a function to each group independently Combining the results into a data structure

See the Grouping section

In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
 ....:    'foo', 'bar', 'foo', 'foo'],
 ....:   'B' : ['one', 'one', 'two', 'three',
 ....:    'two', 'two', 'one', 'three'],
 ....:   'C' : np.random.randn(8),
 ....:   'D' : np.random.randn(8)})
 ....:
In [58]: df
Out[58]:
 A B  C  D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033

group1, and then sum

In [59]: df.groupby('A').sum()
Out[59]:
  C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: df.groupby(['A','B']).sum()
Out[60]:
   C  D
A B
bar one -1.814470 2.395985
 three -0.595447 0.166599
 two -0.392670 -0.136473
foo one -1.195665 -0.616981
 three 1.928123 -1.623033
 two 2.414034 1.600434

conclusion