GroupBy operation in Pandas of Pandas advanced tutorial

  • 2021-11-14 06:25:54
  • OfStack

Directory introduction partition data multi-indexget_groupdropnagroups attribute index hierarchy group traversal aggregation operation general aggregation method can specify multiple aggregation methods at the same time: NamedAgg different columns specify different aggregation methods transformation operation filtering operation Apply operation

Brief introduction

The DF data type in pandas can operate on groupby like database table 1. Generally speaking, groupby operation can be divided into three parts: dividing data, applying transformation and merging data.

This article will explain the groupby operation in Pandas in detail.

Split data

The purpose of dividing data is to divide DF into one group. For groupby operation, you need to specify the corresponding label when creating DF:


df = pd.DataFrame(
   ...:     {
   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
   ...:         "C": np.random.randn(8),
   ...:         "D": np.random.randn(8),
   ...:     }
   ...: )
   ...:

df
Out[61]: 
     A      B         C         D
0  foo    one -0.490565 -0.233106
1  bar    one  0.430089  1.040789
2  foo    two  0.653449 -1.155530
3  bar  three -0.610380 -0.447735
4  foo    two -0.934961  0.256358
5  bar    two -0.256263 -0.661954
6  foo    one -1.132186 -0.304330
7  foo  three  2.129757  0.445744

By default, the axis of groupby is the x axis. You can have one column of group or multiple columns of group:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

Multiple index

In version 0.24, if we had more than one index, we could select a specific index for group:


In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()
Out[12]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

get_group

get_group can get the data after the packet:


In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

dropna

By default, NaN data is excluded from groupby, and NaN data is allowed by setting dropna=False:


In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna
Out[29]: 
   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]: 
     a  c
b        
1.0  2  3
2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]: 
     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

groups Properties

The groupby object has an groups attribute, which is an key-value dictionary, key is the data used for classification, and value is the value corresponding to classification.


In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)
Out[36]: 6

Hierarchy of index

For multilevel index objects, groupby can specify the index level of group:


In [40]: arrays = [
   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s
Out[43]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

group Level 1:


In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()
Out[45]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

group Level 2:


In [46]: s.groupby(level="second").sum()
Out[46]: 
second
one    0.980950
two    1.991575
dtype: float64

Traversal of group

Once we get the group object, we can iterate through group with the for statement:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
0

For multi-field group, the name of group is 1 tuple:


In [64]: for name, group in df.groupby(['A', 'B']):
   ....:     print(name)
   ....:     print(group)
   ....: 
('bar', 'one')
     A    B         C         D
1  bar  one  0.254161  1.511763
('bar', 'three')
     A      B         C         D
3  bar  three  0.215897 -0.990582
('bar', 'two')
     A    B         C         D
5  bar  two -0.077118  1.211526
('foo', 'one')
     A    B         C         D
0  foo  one -0.575247  1.346061
6  foo  one -0.408530  0.268520
('foo', 'three')
     A      B         C        D
7  foo  three -0.862495  0.02458
('foo', 'two')
     A    B         C         D
2  foo  two -1.143704  1.627081
4  foo  two  1.193555 -0.441652

Aggregation operation

After grouping, you can aggregate:


In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)
Out[68]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)
Out[70]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

For multi-index data, the default return value is also multi-index. If you want to use the new index, you can add as_index = False:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
3

The above effect is equivalent to reset_index


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
4

grouped. size () Calculates the size of group:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
5

grouped. describe () describes the information for group:


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
6

General aggregation method

The following is a general aggregation method:

函数 描述
CODE_TAG_REPLACE_MARK_0 平均值
CODE_TAG_REPLACE_MARK_1 求和
CODE_TAG_REPLACE_MARK_2 计算size
CODE_TAG_REPLACE_MARK_3 group的统计
CODE_TAG_REPLACE_MARK_4 标准差
CODE_TAG_REPLACE_MARK_5 方差
CODE_TAG_REPLACE_MARK_6 均值的标准误
CODE_TAG_REPLACE_MARK_7 统计信息描述
CODE_TAG_REPLACE_MARK_8 第1个group值
CODE_TAG_REPLACE_MARK_9 最后1个group值
CODE_TAG_REPLACE_MARK_10 第n个group值
CODE_TAG_REPLACE_MARK_11 最小值
CODE_TAG_REPLACE_MARK_12 最大值

You can specify multiple aggregation methods at the same time:


In [81]: grouped = df.groupby("A")

In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
Out[82]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

You can rename:


In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

NamedAgg

NamedAgg can define aggregation more accurately, and it includes two customized fields, column and aggfunc.


In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])
9

Or just use 1 tuple:


In [91]: animals.groupby("kind").agg(
   ....:     min_height=("height", "min"),
   ....:     max_height=("height", "max"),
   ....:     average_weight=("weight", np.mean),
   ....: )
   ....: 
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

Different columns specify different aggregation methods

By passing in a dictionary to the agg method, you can specify that different columns use different aggregations:


In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

Conversion operation

A transformation is an operation that converts an object to an object of the same size. In the process of data analysis, data conversion is often needed.

Can be connected to lambda operation:


In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

Fill in the na value:


In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

Filtering operation

The filter method can filter data we don't need through lambda expressions:


In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]: 
3    3
4    3
5    3
dtype: int64

Apply operation

Some data may not be suitable for aggregation or transformation operations, and Pandas provides 1 apply Method for more flexible conversion operations.


In [156]: df
Out[156]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
                ...   
foo  min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

You can enclose functions:


In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....: 

In [161]: grouped.apply(f)
Out[161]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

This article has been included in http://www.flydean.com/11-python-pandas-groupby/

The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tips you don't know are waiting for you to discover!


Related articles: