Analyze the cdn logs using the pandas library in Python
- 2020-05-27 05:57:35
- OfStack
preface
Recently, I met a requirement to filter some data according to the CDN log, such as traffic, status code statistics, TOP IP, URL, UA, Referer and so on. Previously, bash shell was used. However, when the log volume was large, the number of log files was G, and the number of lines was tens of millions of times, the processing time through shell was too long. So I studied the use of Python pandas, a data processing library. 10 million line logs, processed around 40s.
code
#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = 'Loya Chen'
import sys
import pandas as pd
from collections import OrderedDict
"""
Description: This script is used to analyse qiniu cdn log.
================================================================================
Log format
IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA"
================================================================================
The log sample
[0] [1][2] [3] [4] [5]
101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -"
[6] [7] [8] [9]
200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
================================================================================
"""
if len(sys.argv) != 2:
print('Usage:', sys.argv[0], 'file_of_log')
exit()
else:
log_file = sys.argv[1]
# The log location corresponding to the required statistics field
ip = 0
url = 5
status_code = 6
size = 7
referer = 8
ua = 9
# Read in the log DataFrame
reader = pd.read_table(log_file, sep=' ', names=[i for i in range(10)], iterator=True)
loop = True
chunkSize = 10000000
chunks = []
while loop:
try:
chunk = reader.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
#Iteration is stopped.
loop = False
df = pd.concat(chunks, ignore_index=True)
byte_sum = df[size].sum() # Traffic statistics
top_status_code = pd.DataFrame(df[6].value_counts()) # Status code statistics
top_ip = df[ip].value_counts().head(10) #TOP IP
top_referer = df[referer].value_counts().head(10) #TOP Referer
top_ua = df[ua].value_counts().head(10) #TOP User-Agent
top_status_code['persent'] = pd.DataFrame(top_status_code/top_status_code.sum()*100)
top_url = df[url].value_counts().head(10) #TOP URL
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the highest traffic requests URL
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
.round(decimals = 3).sort_values(by=[size], ascending=False)[size].head(10) # The one with the most traffic requests IP
# Store the results in the dictionary in order
result = OrderedDict([(" Traffic in total [ unit :GB]:" , byte_sum/1024/1024/1024),
(" Status code statistics [ The number of | The percentage ]:" , top_status_code),
("IP TOP 10:" , top_ip),
("Referer TOP 10:" , top_referer),
("UA TOP 10:" , top_ua),
("URL TOP 10:" , top_url),
(" The one with the highest traffic requests URL TOP 10[ unit :MB]:" , top_url_byte),
(" The one with the highest traffic requests IP TOP 10[ unit :MB]:" , top_ip_byte)
])
# The output
for k,v in result.items():
print(k)
print(v)
print('='*80)
pandas study notes
There are two basic data structures in Pandas, Series and Dataframe. Series is a kind of object similar to a 1-dimensional array, consisting of a set of data and an index. Dataframe is a tabular data structure with both row and column indexes.
from pandas import Series, DataFrame
import pandas as pd
Series
In [1]: obj = Series([4, 7, -5, 3])
In [2]: obj
Out[2]:
0 4
1 7
2 -5
3 3
The string representation of Series is: index on the left and value on the right. When no index is specified, an integer index from 0 to N-1 (N is the length of the data) is automatically created. The array representation and index object can be obtained through Series's values and index properties:
In [3]: obj.values
Out[3]: array([ 4, 7, -5, 3])
In [4]: obj.index
Out[4]: RangeIndex(start=0, stop=4, step=1)
Indexes are usually specified when Series is created:
In [5]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [6]: obj2
Out[6]:
d 4
b 7
a -5
c 3
Get a single or 1 group of values in Series by index:
In [7]: obj2['a']
Out[7]: -5
In [8]: obj2[['c','d']]
Out[8]:
c 3
d 4
The sorting
In [9]: obj2.sort_index()
Out[9]:
a -5
b 7
c 3
d 4
In [10]: obj2.sort_values()
Out[10]:
a -5
c 3
d 4
b 7
Filter operation
In [11]: obj2[obj2 > 0]
Out[11]:
d 4
b 7
c 3
In [12]: obj2 * 2
Out[12]:
d 8
b 14
a -10
c 6
Members of the
In [13]: 'b' in obj2
Out[13]: True
In [14]: 'e' in obj2
Out[14]: False
Create Series from the dictionary
In [15]: sdata = {'Shanghai':35000, 'Beijing':40000, 'Nanjing':26000, 'Hangzhou':30000}
In [16]: obj3 = Series(sdata)
In [17]: obj3
Out[17]:
Beijing 40000
Hangzhou 30000
Nanjing 26000
Shanghai 35000
If only one dictionary is passed in, the resulting index in Series is the key of the original dictionary (ordered order)
from pandas import Series, DataFrame
import pandas as pd
0
When index is specified, the three values in sdata that match the states index are found and placed in the response position, but since the sdata value corresponding to 'Suzhou' cannot be found, the result is NaN(not a number), which is used in pandas to indicate missing or NA values
pandas's isnull and notnull functions can be used to detect missing data:
In [21]: pd.isnull(obj4)
Out[21]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True
In [22]: pd.notnull(obj4)
Out[22]:
Beijing True
Hangzhou True
Shanghai True
Suzhou False
Series has a similar instance method
In [23]: obj4.isnull()
Out[23]:
Beijing False
Hangzhou False
Shanghai False
Suzhou True
An important feature of Series is that it automatically aligns data of different indexes in data operations
from pandas import Series, DataFrame
import pandas as pd
3
The index of Series can be modified in place by copying
from pandas import Series, DataFrame
import pandas as pd
4
DataFrame
pandas reads files
from pandas import Series, DataFrame
import pandas as pd
5
DataFrame column selection
from pandas import Series, DataFrame
import pandas as pd
6
In [31]: df['name']
Out[31]:
0 Bob
1 Loya
2 Denny
3 Mars
Name: name, dtype: object
DataFrame line selection
df.iloc[0,:] # The first 1 The argument is line number, line number 2 Three parameters are the columns. This refers to the first 0 Rows all columns
df.iloc[:,0] # All lines, no 0 column
from pandas import Series, DataFrame
import pandas as pd
9
Get 1 element by iloc, or faster by iat
In [34]: df.iloc[1,1]
Out[34]: 22
In [35]: df.iat[1,1]
Out[35]: 22
DataFrame block selection
In [36]: df.loc[1:2,['name','age']]
Out[36]:
name age
1 Loya 22
2 Denny 20
Filter the rows according to the conditions
Filter the rows by adding a judgment condition in square brackets that must return either True or False
In [37]: df[(df.index >= 1) & (df.index <= 3)]
Out[37]:
name age city
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
In [38]: df[df['age'] > 22]
Out[38]:
name age city
0 Bob 26 Beijing
3 Mars 25 Nanjing
Increase the column
In [39]: df['city'] = ['Beijing', 'Shanghai', 'Hangzhou', 'Nanjing']
In [40]: df
Out[40]:
name age city
0 Bob 26 Beijing
1 Loya 22 Shanghai
2 Denny 20 Hangzhou
3 Mars 25 Nanjing
The sorting
Sort by the specified column
In [41]: df.sort_values(by='age')
Out[41]:
name age city
2 Denny 20 Hangzhou
1 Loya 22 Shanghai
3 Mars 25 Nanjing
0 Bob 26 Beijing
# The introduction of numpy build DataFrame
import numpy as np
In [42]: df = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
In [43]: df
Out[43]:
d a b c
three 0 1 2 3
one 4 5 6 7
# Sort by index
In [44]: df.sort_index()
Out[44]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [45]: df.sort_index(axis=1)
Out[45]:
a b c d
three 1 2 3 0
one 5 6 7 4
# Descending order
In [46]: df.sort_index(axis=1, ascending=False)
Out[46]:
d c b a
three 0 3 2 1
one 4 7 6 5
To view
# Check the header 5 line
df.head(5)
# Look at the end of the table 5 line
df.tail(5)
# View the name of the column
In [47]: df.columns
Out[47]: Index(['name', 'age', 'city'], dtype='object')
# View the current value of the table
In [48]: df.values
Out[48]:
array([['Bob', 26, 'Beijing'],
['Loya', 22, 'Shanghai'],
['Denny', 20, 'Hangzhou'],
['Mars', 25, 'Nanjing']], dtype=object)
transpose
df.T
Out[49]:
0 1 2 3
name Bob Loya Denny Mars
age 26 22 20 25
city Beijing Shanghai Hangzhou Nanjing
Using isin
In [50]: df2 = df.copy()
In [51]: df2[df2['city'].isin(['Shanghai','Nanjing'])]
Out[52]:
name age city
1 Loya 22 Shanghai
3 Mars 25 Nanjing
Operation:
In [53]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
...: index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
In [54]: df
Out[54]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# According to the column sum
In [55]: df.sum()
Out[55]:
one 9.25
two -5.80
# According to the row sum
In [56]: df.sum(axis=1)
Out[56]:
a 1.40
b 2.60
c NaN
d -0.55
group
group refers to the following steps:
See the Grouping section
In [57]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
....:
In [58]: df
Out[58]:
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
group1, and then sum
In [59]: df.groupby('A').sum()
Out[59]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [60]: df.groupby(['A','B']).sum()
Out[60]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
conclusion