python pandas Packet Aggregation Details

2021-11-29 07:26:46
OfStack

Directory python pandas grouping aggregation 1, environment 2, grouping 3, sequence grouping 4, multi-column grouping 5, index grouping 7, aggregation 8, single function to multi-column 9, multi-function to multi-column

python pandas packet aggregation

1. Environment

python3.9 win10 64bit pandas==1.2.1

groupby The method is the grouping method in pandas, and the data frame is adopted groupby Method, returns the DataFrameGroupBy Object, the aggregation operation will be performed after the grouping operation.

2. Grouping


import pandas as pd
import numpy as np
pd.set_option('display.notebook_repr_html',False)
#  Data preparation 
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [1, 2, 3, 4],'C':[6,8,1,9]})
df

For the data box, press A Columns are grouped to generate grouped data boxes. Grouping data boxes are iterable objects that can be iterated through, and you can see that in the loop, the type of each element is tuple,

The first element of the tuple is the grouping value, and the second element is the corresponding grouping data box.


#  Grouping 
g_df=df.groupby('A')
#  Grouped data box class 
type(g_df)


pandas.core.groupby.generic.DataFrameGroupBy


#  Cyclic packet data 
for i in g_df:
    print(i,type(i),end='\n\n')


(1,    A  B  C
0  1  1  6
1  1  2  8) <class 'tuple'>


(2,    A  B  C
2  2  3  1
3  2  4  9) <class 'tuple'>

You can use the aggregation method directly for grouped data frames agg The statistical function value is calculated for every 1 column of the grouped data box.


#  Group summation 
df.groupby('A').agg('sum')
   B   C
A       
1  3  14
2  7  10

3. Sequence grouping

Data frames can be grouped according to the sequence data outside the data frame, and it should be noted that the sequence length needs to be the same as the number of rows in the data frame.


#  Defining a grouping list 
label=['a','a','b','b']
#  Group summation 
df.groupby(label).agg('sum')
   A  B   C
a  2  3  14
b  4  7  10

4. Multi-column grouping

Data boxes can be grouped according to their multiple columns.


#  Data preparation 
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [3, 4, 3, 3],'C':[6,8,1,9]})
df

According to A , B Columns are grouped and summed.


#  Sum according to multiple column groups 
df.groupby(['A','B']).agg('sum')

5. Index grouping

Data boxes can be grouped by index, and level parameters need to be set.

The data box has only 1 layer index, and the parameters are set level=0 .

When the data box index has multiple layers, level parameters can also be set according to requirements to complete grouping aggregation.


         value
id1 id2       
1   3        4
    4        7
2   3        2
    3        9

Settings level Parameter, if you need to group according to the Layer 1 index, that is, id1, you can set level=0 Or groupby0 Complete grouping aggregation.


#  According to Article 1 Layer index grouping summation 
df.groupby(level=0).agg('sum')


#  According to Article 1 Layer index grouping summation 
df.groupby(level='id1').agg('sum')

7. Polymerization

After grouping, 1 will generally perform aggregation operation, using agg Method for aggregation.


#  Data preparation 
df = pd.DataFrame({'A': [1, 1, 2, 2],'B': [3, 4, 3, 3],'C':[6,8,1,9],'D':[2,5,4,8]})
df


   A  B  C  D
0  1  3  6  2
1  1  4  8  5
2  2  3  1  4
3  2  3  9  8

8. Single function versus multiple columns

The grouped data boxes are aggregated using a single function that evaluates each column and then merges back. Aggregate function is passed in as a string.


#  Sum all columns in groups 
df.groupby('A').agg('sum')


   B   C   D
A           
1  7  14   7
2  6  10  12

The grouped data can be grouped and aggregated by specifying columns. Need attention 子列需要用[]包裹 .


#  Groups and sums the specified columns 
df.groupby('A')[['B','C']].agg('sum')

Aggregate functions can also be passed in custom anonymous functions.


#  Group summation of anonymous functions 
df.groupby('A').agg(lambda x:sum(x))


  B   C   D
A           
1  7  14   7
2  6  10  12

9. Multiple functions versus multiple columns

Aggregate functions can be multiple functions. When aggregating, multiple aggregate functions evaluate each column, and then merge and return. Aggregate functions are passed in as a list.


#  Multifunction aggregation of all columns 
df.groupby('A').agg(['sum','mean'])


    B        C        D     
  sum mean sum mean sum mean
A                           
1   7  3.5  14    7   7  3.5
2   6  3.0  10    5  12  6.0

The data column name returned by aggregation has two levels of index, the first level is the column name of aggregation, and the second level is the aggregate function name used. If you need to rename the returned aggregate function name,
When passing parameters, pass in tuples, the first element is the aggregate function name, and the second element is the aggregate function.


#  Aggregate function renaming 
df.groupby('A').agg([('SUM','sum'),('MEAN','mean')])


    B        C        D     
  SUM MEAN SUM MEAN SUM MEAN
A                           
1   7  3.5  14    7   7  3.5
2   6  3.0  10    5  12  6.0

Similarly, anonymous functions can be passed in.


#  Anonymous function and rename it 
df.groupby('A').agg([('SUM','sum'),('MAX',lambda x:max(x))])


    B       C       D    
  SUM MAX SUM MAX SUM MAX
A                        
1   7   4  14   8   7   5
2   6   3  10   9  12   8

If you need to perform different aggregation calculations on different columns, you need to pass in the form of a dictionary.


#  Different columns and different aggregate functions 
df.groupby('A').agg({'B':['sum','mean'],'C':'mean'})


    B         C
  sum mean mean
A              
1   7  3.5    7
2   6  3.0    5

You can rename the aggregated column name, note that 只能对1列传入1个聚合函数时有效 .


#  Rename column names after aggregation 
df.groupby('A').agg(B_sum=('B','sum'),C_mean=('C','mean'))


   B_sum  C_mean
A               
1      7       7
2      6       5