Use of grouping function groupby and grouping operation function agg in python

2021-12-11 08:31:25
OfStack

Directory groupby: agg:

Today, I will introduce a very useful function groupby in pandas. In fact, the effect of groupby in hive is the same, and the difference lies in the writing of the two languages. groupby is useful in grouping statistics in Python ~

groupby:

Create the data first:


import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'], 
                       'B': [2, 7, 1, 3, 3, 2, 4, 8], 
                       'C': [100, 87, 96, 130, 105, 87, 96, 155]})
df
Out[2]: 
   A  B    C
0  a  2  100
1  b  7   87
2  a  1   96
3  c  3  130
4  a  3  105
5  c  2   87
6  b  4   96

Basic operation of groupby in pandas:

1. Grouping according to A column, and calculating the average value of B and C columns:


df.groupby('A').mean()
Out[6]: 
          B           C
A                      
a  2.000000  100.333333
b  5.500000   91.500000
c  4.333333  124.000000

Of course, you can also group according to multiple columns to obtain the average value of other columns:


df.groupby(['A','B']).mean()
Out[7]: 
       C
A B     
a 1   96
  2  100
  3  105
b 4   96
  7   87
c 2   87
  3  130
  8  155

2. After grouping, select columns for calculation:


data=df.groupby('A')
data['B'].std()
Out[11]: 
A
a    1.00000
b    2.12132
c    3.21455
Name: B, dtype: float64
 
# Select B , C Two columns 
data['B','C'].mean()
Out[12]: 
          B           C
A                      
a  2.000000  100.333333
b  5.500000   91.500000
c  4.333333  124.000000

3. After grouping by A, you can use different aggregation methods for different columns (ps: this point is very similar to hive)


data.agg({'B':'mean','C':'sum'})    #B Column mean, C Column summary 
Out[14]: 
     C         B
A               
a  301  2.000000
b  183  5.500000
c  372  4.333333

4. If we use the same aggregation method for multiple columns after grouping according to A, we can use apply function:


df.groupby('A').apply(np.mean)
Out[25]: 
          B           C
A                      
a  2.000000  100.333333
b  5.500000   91.500000
c  4.333333  124.000000

5. Divide a column of data into different range segments according to data values for grouping operation

To create a dataset:


np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100), 
                        'Sex': np.random.choice(['Male', 'Female'], 100), 
                        'number_of_foo': np.random.randint(1, 20, 100)})
Out[38]: 
   Age     Sex  number_of_foo
0   64  Female             14
1   67  Female             14
2   20  Female             12
3   23    Male             17
4   23  Female             15

Goal: The age fields are divided into three groups, which are achieved in the following two ways:


# No. 1 1 Methods: 
1 , bins=4
pd.cut(df['Age'], bins=4)
0       (56.75, 69.0]
1       (56.75, 69.0]
2     (19.951, 32.25]
3     (19.951, 32.25]
4     (19.951, 32.25]...
 
# No. 1 2 Methods 
2 , bins=[19, 40, 65, np.inf]
pd.cut(df['Age'], bins=[19,40,65,np.inf])
Out[40]: 
0     (40.0, 65.0]
1      (65.0, inf]
2     (19.0, 40.0]
3     (19.0, 40.0]
4     (19.0, 40.0]
 
# The grouping range results are as follows: 
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()
Out[43]: 
                    Age  number_of_foo
Age                                   
(19.0, 40.0]  29.840000       9.880000
(40.0, 65.0]  52.833333       9.452381
(65.0, inf]   67.375000       9.250000
 
# Press' Age' Cluster scope and gender ( sex ) Make a crosstab 
 
pd.crosstab(age_groups, df['Sex'])
Out[44]: 
Sex           Female  Male
Age                       
(19.0, 40.0]      22    28
(40.0, 65.0]      18    24
(65.0, inf]        3     5

agg:

1. After grouping by one column (A) using groupby, you need to adopt a different aggregation method for the other column:


df.groupby('A')['B'].agg({'mean':np.mean, 'std': np.std})
 
Out[16]: 
       std      mean
A                   
a  1.00000  2.000000
b  2.12132  5.500000
c  3.21455  4.333333

2. After grouping according to a column, adopt different aggregation methods for different columns:


df.groupby('A').agg({'B':[np.mean,'sum'],'C':['count',np.std]})  #[] There are two methods corresponding to the 
 
Out[17]: 
      C                    B    
  count        std      mean sum
A                               
a     3   4.509250  2.000000   6
b     2   6.363961  5.500000  11
c     3  34.394767  4.333333  13

transform:

The results obtained by the first two methods are indexed by A column values. What should I do if I use index without groupby grouping? At this point, the transform function will be used. The transform (func, args, * kwargs) method simplifies this process by applying the func parameter to all groupings and then placing the result on the index of the original array:


df.groupby('A').mean()
Out[6]: 
          B           C
A                      
a  2.000000  100.333333
b  5.500000   91.500000
c  4.333333  124.000000

It can be seen that when grouping according to A column and counting B and C columns, the indexes with a as a have [0, 2, 4], so the values of [0, 2, 4] indexes in the result column are all 3, which is equivalent to broadcasting. For the C column, the same is true.