python pandas Packet Aggregation Details
- 2021-11-29 07:26:46
- OfStack
python pandas packet aggregation
1. Environment
python3.9 win10 64bit pandas==1.2.1
groupby
The method is the grouping method in pandas, and the data frame is adopted
groupby
Method, returns the
DataFrameGroupBy
Object, the aggregation operation will be performed after the grouping operation.
2. Grouping
import pandas as pd
import numpy as np
pd.set_option('display.notebook_repr_html',False)
# Data preparation
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [1, 2, 3, 4],'C':[6,8,1,9]})
df
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
For the data box, press
A
Columns are grouped to generate grouped data boxes. Grouping data boxes are iterable objects that can be iterated through, and you can see that in the loop, the type of each element is tuple,
The first element of the tuple is the grouping value, and the second element is the corresponding grouping data box.
# Grouping
g_df=df.groupby('A')
# Grouped data box class
type(g_df)
pandas.core.groupby.generic.DataFrameGroupBy
# Cyclic packet data
for i in g_df:
print(i,type(i),end='\n\n')
(1, A B C
0 1 1 6
1 1 2 8) <class 'tuple'>
(2, A B C
2 2 3 1
3 2 4 9) <class 'tuple'>
You can use the aggregation method directly for grouped data frames
agg
The statistical function value is calculated for every 1 column of the grouped data box.
# Group summation
df.groupby('A').agg('sum')
B C
A
1 3 14
2 7 10
3. Sequence grouping
Data frames can be grouped according to the sequence data outside the data frame, and it should be noted that the sequence length needs to be the same as the number of rows in the data frame.
# Defining a grouping list
label=['a','a','b','b']
# Group summation
df.groupby(label).agg('sum')
A B C
a 2 3 14
b 4 7 10
4. Multi-column grouping
Data boxes can be grouped according to their multiple columns.
# Data preparation
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [3, 4, 3, 3],'C':[6,8,1,9]})
df
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
0
According to
A
,
B
Columns are grouped and summed.
# Sum according to multiple column groups
df.groupby(['A','B']).agg('sum')
C
A B
1 3 6
4 8
2 3 10
5. Index grouping
Data boxes can be grouped by index, and level parameters need to be set.
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
3
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
4
The data box has only 1 layer index, and the parameters are set
level=0
.
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
5
When the data box index has multiple layers, level parameters can also be set according to requirements to complete grouping aggregation.
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
6
value
id1 id2
1 3 4
4 7
2 3 2
3 9
Settings
level
Parameter, if you need to group according to the Layer 1 index, that is, id1, you can set
level=0
Or
groupby
0
Complete grouping aggregation.
# According to Article 1 Layer index grouping summation
df.groupby(level=0).agg('sum')
A B C
0 1 1 6
1 1 2 8
2 2 3 1
3 2 4 9
9
# According to Article 1 Layer index grouping summation
df.groupby(level='id1').agg('sum')
value
id1
1 11
2 11
7. Polymerization
After grouping, 1 will generally perform aggregation operation, using
agg
Method for aggregation.
# Data preparation
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [3, 4, 3, 3],'C':[6,8,1,9],'D':[2,5,4,8]})
df
A B C D
0 1 3 6 2
1 1 4 8 5
2 2 3 1 4
3 2 3 9 8
8. Single function versus multiple columns
The grouped data boxes are aggregated using a single function that evaluates each column and then merges back. Aggregate function is passed in as a string.
# Sum all columns in groups
df.groupby('A').agg('sum')
B C D
A
1 7 14 7
2 6 10 12
The grouped data can be grouped and aggregated by specifying columns. Need attention
子列需要用[]包裹
.
# Groups and sums the specified columns
df.groupby('A')[['B','C']].agg('sum')
B C
A
1 7 14
2 6 10
Aggregate functions can also be passed in custom anonymous functions.
# Group summation of anonymous functions
df.groupby('A').agg(lambda x:sum(x))
B C D
A
1 7 14 7
2 6 10 12
9. Multiple functions versus multiple columns
Aggregate functions can be multiple functions. When aggregating, multiple aggregate functions evaluate each column, and then merge and return. Aggregate functions are passed in as a list.
# Multifunction aggregation of all columns
df.groupby('A').agg(['sum','mean'])
B C D
sum mean sum mean sum mean
A
1 7 3.5 14 7 7 3.5
2 6 3.0 10 5 12 6.0
The data column name returned by aggregation has two levels of index, the first level is the column name of aggregation, and the second level is the aggregate function name used. If you need to rename the returned aggregate function name,
When passing parameters, pass in tuples, the first element is the aggregate function name, and the second element is the aggregate function.
# Aggregate function renaming
df.groupby('A').agg([('SUM','sum'),('MEAN','mean')])
B C D
SUM MEAN SUM MEAN SUM MEAN
A
1 7 3.5 14 7 7 3.5
2 6 3.0 10 5 12 6.0
Similarly, anonymous functions can be passed in.
# Anonymous function and rename it
df.groupby('A').agg([('SUM','sum'),('MAX',lambda x:max(x))])
B C D
SUM MAX SUM MAX SUM MAX
A
1 7 4 14 8 7 5
2 6 3 10 9 12 8
If you need to perform different aggregation calculations on different columns, you need to pass in the form of a dictionary.
# Different columns and different aggregate functions
df.groupby('A').agg({'B':['sum','mean'],'C':'mean'})
B C
sum mean mean
A
1 7 3.5 7
2 6 3.0 5
You can rename the aggregated column name, note that
只能对1列传入1个聚合函数时有效
.
# Rename column names after aggregation
df.groupby('A').agg(B_sum=('B','sum'),C_mean=('C','mean'))
B_sum C_mean
A
1 7 7
2 6 5