Detailed explanation of specific usage of GroupBy in Pandas

  • 2021-11-14 06:21:19
  • OfStack

Directory Introduction
Split data
Multiple index
get_group
dropna
groups Attributes
Hierarchy of index
Traversal of group
Aggregation operation
General aggregation method
Using multiple aggregation methods at the same time
NamedAgg
Different columns specify different aggregation methods
Conversion operation
Filtering operation
Apply operation

Brief introduction

The DF data type in pandas can perform groupby operations like database table 1. Generally speaking, groupby operation can be divided into three parts: splitting data, applying transformation and merging data.

This article will explain the groupby operation in Pandas in detail.

Split data

The purpose of dividing data is to divide DF into one group. To perform the groupby operation, you need to specify the corresponding label when creating the DF:


df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...:

df
Out[61]: 
     A      B         C         D
0  foo    one -0.490565 -0.233106
1  bar    one  0.430089  1.040789
2  foo    two  0.653449 -1.155530
3  bar  three -0.610380 -0.447735
4  foo    two -0.934961  0.256358
5  bar    two -0.256263 -0.661954
6  foo    one -1.132186 -0.304330
7  foo  three  2.129757  0.445744

By default, the axis of groupby is the x axis. You can have one column of group or multiple columns of group:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

Multiple index

In version 0.24, if we have more than one index, we can select a specific index for group:


In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

get_group

get_group can get the data after the packet:


In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

By default, NaN data is excluded from groupby, and NaN data is allowed by setting dropna=False:


In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2
# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

groups Properties

The groupby object has an groups attribute, which is an key-value dictionary, key is the data used for classification, and value is the value corresponding to classification.


In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

Hierarchy of index

For multi-level index objects, groupby can specify the index level of group:


In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

group Level 1:


In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

group Level 2:


In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

Traversal of group

With the group object, we can iterate through group with the for statement:


In [62]: grouped = df.groupby('A')

In [63]: for name, group in grouped:
   ....:     print(name)
   ....:     print(group)
   ....: 
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In the case of multi-field group, the name of group is 1 tuple:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

0

Aggregation operation

After grouping, you can aggregate:


In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)
Out[68]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)
Out[70]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

For multi-index data, the default return value is also multi-index. If you want to use the new index, you can add as_index = False:


In [71]: grouped = df.groupby(["A", "B"], as_index=False)

In [72]: grouped.aggregate(np.sum)
Out[72]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [73]: df.groupby("A", as_index=False).sum()
Out[73]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

The above effect is equivalent to reset_index


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

3

grouped. describe () describes the information for group:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

4

General aggregation method

The following is a general aggregation method:

函数 描述
mean() 平均值
sum() 求和
size() 计算size
count() group的统计
std() 标准差
var() 方差
sem() 均值的标准误
describe() 统计信息描述
first() 第1个group值
last() 最后1个group值
nth() 第n个group值
min() 最小值
max() 最大值

Using multiple aggregation methods at the same time

You can specify multiple aggregation methods at the same time:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

5

You can rename:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

6

NamedAgg

NamedAgg gives a more precise definition of aggregation, which includes two customized fields, column and aggfunc.


In [88]: animals = pd.DataFrame(
   ....:     {
   ....:         "kind": ["cat", "dog", "cat", "dog"],
   ....:         "height": [9.1, 6.0, 9.5, 34.0],
   ....:         "weight": [7.9, 7.5, 9.9, 198.0],
   ....:     }
   ....: )
   ....:

In [89]: animals
Out[89]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(
   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
   ....: )
   ....: 
Out[90]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Or just use 1 tuple:


In [91]: animals.groupby("kind").agg(
   ....:     min_height=("height", "min"),
   ....:     max_height=("height", "max"),
   ....:     average_weight=("weight", np.mean),
   ....: )
   ....: 
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Different columns specify different aggregation methods

By passing in a dictionary to the agg method, you can specify that different columns use different aggregations:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

9

Conversion operation

A transformation is an operation that converts an object to an object of the same size. In the process of data analysis, data conversion is often needed.

Can be connected to lambda operation:


In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

Fill in the na value:


In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

Filtering operation

The filter method can filter data we don't need through lambda expressions:


In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64

Apply operation

Some data may not be suitable for aggregation or transformation operations. Pandas provides an apply method for more flexible transformation operations.


In [156]: df
Out[156]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
                ...   
foo  min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

You can enclose functions:


In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....:

In [161]: grouped.apply(f)
Out[161]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211


Related articles: