GroupBy operation in Pandas of Pandas advanced tutorial
- 2021-11-14 06:25:54
- OfStack
Brief introduction
The DF data type in pandas can operate on groupby like database table 1. Generally speaking, groupby operation can be divided into three parts: dividing data, applying transformation and merging data.
This article will explain the groupby operation in Pandas in detail.
Split data
The purpose of dividing data is to divide DF into one group. For groupby operation, you need to specify the corresponding label when creating DF:
df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
...: "D": np.random.randn(8),
...: }
...: )
...:
df
Out[61]:
A B C D
0 foo one -0.490565 -0.233106
1 bar one 0.430089 1.040789
2 foo two 0.653449 -1.155530
3 bar three -0.610380 -0.447735
4 foo two -0.934961 0.256358
5 bar two -0.256263 -0.661954
6 foo one -1.132186 -0.304330
7 foo three 2.129757 0.445744
By default, the axis of groupby is the x axis. You can have one column of group or multiple columns of group:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
Multiple index
In version 0.24, if we had more than one index, we could select a specific index for group:
In [10]: df2 = df.set_index(["A", "B"])
In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
get_group
get_group can get the data after the packet:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})
In [25]: df3.groupby(["X"]).get_group("A")
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(["X"]).get_group("B")
Out[26]:
X Y
1 B 4
3 B 2
dropna
By default, NaN data is excluded from groupby, and NaN data is allowed by setting dropna=False:
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
In [29]: df_dropna
Out[29]:
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]:
a c
b
1.0 2 3
2.0 2 5
# In order to allow NaN in keys, set ``dropna`` to False
In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]:
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
groups Properties
The groupby object has an groups attribute, which is an key-value dictionary, key is the data used for classification, and value is the value corresponding to classification.
In [34]: grouped = df.groupby(["A", "B"])
In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
In [36]: len(grouped)
Out[36]: 6
Hierarchy of index
For multilevel index objects, groupby can specify the index level of group:
In [40]: arrays = [
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....:
In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
In [42]: s = pd.Series(np.random.randn(8), index=index)
In [43]: s
Out[43]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
group Level 1:
In [44]: grouped = s.groupby(level=0)
In [45]: grouped.sum()
Out[45]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
group Level 2:
In [46]: s.groupby(level="second").sum()
Out[46]:
second
one 0.980950
two 1.991575
dtype: float64
Traversal of group
Once we get the group object, we can iterate through group with the for statement:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
0
For multi-field group, the name of group is 1 tuple:
In [64]: for name, group in df.groupby(['A', 'B']):
....: print(name)
....: print(group)
....:
('bar', 'one')
A B C D
1 bar one 0.254161 1.511763
('bar', 'three')
A B C D
3 bar three 0.215897 -0.990582
('bar', 'two')
A B C D
5 bar two -0.077118 1.211526
('foo', 'one')
A B C D
0 foo one -0.575247 1.346061
6 foo one -0.408530 0.268520
('foo', 'three')
A B C D
7 foo three -0.862495 0.02458
('foo', 'two')
A B C D
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
Aggregation operation
After grouping, you can aggregate:
In [67]: grouped = df.groupby("A")
In [68]: grouped.aggregate(np.sum)
Out[68]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [69]: grouped = df.groupby(["A", "B"])
In [70]: grouped.aggregate(np.sum)
Out[70]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
For multi-index data, the default return value is also multi-index. If you want to use the new index, you can add as_index = False:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
3
The above effect is equivalent to reset_index
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
4
grouped. size () Calculates the size of group:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
5
grouped. describe () describes the information for group:
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
6
General aggregation method
The following is a general aggregation method:
函数 | 描述 |
---|---|
CODE_TAG_REPLACE_MARK_0 | 平均值 |
CODE_TAG_REPLACE_MARK_1 | 求和 |
CODE_TAG_REPLACE_MARK_2 | 计算size |
CODE_TAG_REPLACE_MARK_3 | group的统计 |
CODE_TAG_REPLACE_MARK_4 | 标准差 |
CODE_TAG_REPLACE_MARK_5 | 方差 |
CODE_TAG_REPLACE_MARK_6 | 均值的标准误 |
CODE_TAG_REPLACE_MARK_7 | 统计信息描述 |
CODE_TAG_REPLACE_MARK_8 | 第1个group值 |
CODE_TAG_REPLACE_MARK_9 | 最后1个group值 |
CODE_TAG_REPLACE_MARK_10 | 第n个group值 |
CODE_TAG_REPLACE_MARK_11 | 最小值 |
CODE_TAG_REPLACE_MARK_12 | 最大值 |
You can specify multiple aggregation methods at the same time:
In [81]: grouped = df.groupby("A")
In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
Out[82]:
sum mean std
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
You can rename:
In [84]: (
....: grouped["C"]
....: .agg([np.sum, np.mean, np.std])
....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
....: )
....:
Out[84]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
NamedAgg
NamedAgg can define aggregation more accurately, and it includes two customized fields, column and aggfunc.
In [8]: grouped = df.groupby("A")
In [9]: grouped = df.groupby(["A", "B"])
9
Or just use 1 tuple:
In [91]: animals.groupby("kind").agg(
....: min_height=("height", "min"),
....: max_height=("height", "max"),
....: average_weight=("weight", np.mean),
....: )
....:
Out[91]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Different columns specify different aggregation methods
By passing in a dictionary to the agg method, you can specify that different columns use different aggregations:
In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
Conversion operation
A transformation is an operation that converts an object to an object of the same size. In the process of data analysis, data conversion is often needed.
Can be connected to lambda operation:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
Fill in the na value:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
Filtering operation
The filter method can filter data we don't need through lambda expressions:
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[137]:
3 3
4 3
5 3
dtype: int64
Apply operation
Some data may not be suitable for aggregation or transformation operations, and Pandas provides 1
apply
Method for more flexible conversion operations.
In [156]: df
Out[156]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [157]: grouped = df.groupby("A")
# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]:
A
bar count 3.000000
mean 0.130980
std 0.181231
min -0.077118
25% 0.069390
...
foo min -1.143704
25% -0.862495
50% -0.575247
75% -0.408530
max 1.193555
Name: C, Length: 16, dtype: float64
You can enclose functions:
In [159]: grouped = df.groupby('A')['C']
In [160]: def f(group):
.....: return pd.DataFrame({'original': group,
.....: 'demeaned': group - group.mean()})
.....:
In [161]: grouped.apply(f)
Out[161]:
original demeaned
0 -0.575247 -0.215962
1 0.254161 0.123181
2 -1.143704 -0.784420
3 0.215897 0.084917
4 1.193555 1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211
This article has been included in http://www.flydean.com/11-python-pandas-groupby/
The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tips you don't know are waiting for you to discover!