An Example of pandas group Packet and agg Aggregation
- 2021-10-15 11:05:12
- OfStack
As follows:
import pandas as pd
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'],
'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
The constructed data is as follows:
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
Grouping
Single column grouping
df_gb = df.groupby('Country')
for index, data in df_gb:
print(index)
print(data)
Output
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
Multi-column grouping
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
Output
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
Polymerization
Aggregate the grouped data
Aggregate other columns after grouping by default
df_agg = df.groupby('Country').agg(['min', 'mean', 'max'])
print(df_agg)
Output
Age Income
min mean max min mean max
Country
America 250 250.000000 250 40000 40000.000000 40000
China 4321 4607.000000 5000 8000 9333.333333 10000
India 1234 3188.333333 4321 5000 5000.666667 5002
Japan 250 250.000000 250 50000 50000.000000 50000
Aggregate some columns after grouping
In some cases, only some data needs to be aggregated differently, which can be constructed by dictionary
num_agg = {'Age':['min', 'mean', 'max']}
print(df.groupby('Country').agg(num_agg))
Output
Age
min mean max
Country
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan 250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
print(df.groupby('Country').agg(num_agg))
Output
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
0
Supplement: pandas-very complete groupby, agg, grouping and statistics of tabular data
I don't write this groupby well. It's too complicated. In fact, only a few are often used. For example, it is easy to understand and use the commonly used ones by putting them into that one. Write another one in the future.
groupby Features: Grouping
groupby + agg (Aggregate Functions): After grouping, apply 1 function to each group, such as' sum ',' mean ',' max ',' min '…
groupby is grouped vertically by default, axis=0
DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
3
Grouping and iterating over the grouping
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
4
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
5
After list: [(group1), (group2), …]
Per data slice (group) format: (name, group) tuple
1. Grouped by key1 (1 column), which is actually the value of key1
The groupby object supports iteration and produces 1 set of 2 yuan tuples: (group name, data block), (group name, data block) …
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
6
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
7
2. Grouped by [key1, key2] (multiple columns)
For multiple keys, 1 set of 2 tuples is generated: ((k1, k2), data block), ((k1, k2), data block)
The first element is a tuple composed of key values
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
8
Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000
9
3. Grouping by Function
4. Group by dictionary
5. Grouping by Index Level
6. Mixing functions with arrays, lists, dictionaries, Series is not a problem either, because everything will eventually be converted to arrays
Make these data fragments into dictionaries
dict(list(df.groupby(['key1'])))#dict(list())
{'a': data1 data2 key1 key2
0 -0.410122 0.247895 a one
1 -0.627470 -0.989268 a two
4 -0.297191 0.954447 a one, 'b': data1 data2 key1 key2
2 0.179488 -0.054570 b one
3 -0.299878 -1.640494 b two}
After grouping, make some statistics, calculations, etc.
1. After grouping, return 1 Series with grouping size
Grouped by key1
df.groupby(['key1']).size()
key1
a 3
b 2
dtype: int64
dict(['a1','x2','e3'])
{'a': '1', 'e': '3', 'x': '2'}
Grouped by [key1, key2]
df.groupby(['key1','key2']).size()
df_gb = df.groupby('Country')
for index, data in df_gb:
print(index)
print(data)
6
2. Group data1 by key1 and calculate the average of data1 columns
df_gb = df.groupby('Country')
for index, data in df_gb:
print(index)
print(data)
7
df_gb = df.groupby('Country')
for index, data in df_gb:
print(index)
print(data)
8
df_gb = df.groupby('Country')
for index, data in df_gb:
print(index)
print(data)
9
key1
a -0.444928
b -0.060195
Name: data1, dtype: float64
Description:
groupby does not make any calculations. It is only divided into 1 groups.
The data (Series) is aggregated according to the grouping key, resulting in a new Series whose index is the only value in the key1 column.
The object returned by this indexing operation is either a grouped DataFrame (if a list or array is passed in) or a grouped Series
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
1
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
2
3. Group data1 by [key1, key2] and calculate the average value of data1
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
3
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
4
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
5
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
6
The data is grouped by two keys, and the resulting Series has one hierarchical index (composed of only one key pair):
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
7
key2 | one | two |
---|---|---|
key1 | ||
a | -0.353657 | -0.627470 |
b | 0.179488 | -0.299878 |
In the above examples, the grouping key is Series. In fact, the grouping key can be any array of appropriate length. Very flexible.
In the transverse direction
By column data type (df. dtypes)
df has two data types: float64 and object, so it is divided into two groups (dtype ('float64'), data slice) and (dtype ('O'), data slice)
America
Age Country Income
4 250 America 40000
China
Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
Age Country Income
5 250 Japan 50000
8
[(dtype('float64'), data1 data2
0 -0.410122 0.247895
1 -0.627470 -0.989268
2 0.179488 -0.054570
3 -0.299878 -1.640494
4 -0.297191 0.954447), (dtype('O'), key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one)]
Application of agg
groupby+agg can apply multiple functions to the results of groupby at the same time
Method agg () parameter for SeriesGroupBy:
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
0
Return: Series for aggregated
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
1
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
2
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
3
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
4
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
5
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
6
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
7
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
8
min | max | |
---|---|---|
1 | 10 | 20 |
2 | 30 | 40 |
Often used like this:
df
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | -0.410122 | 0.247895 | a | one |
1 | -0.627470 | -0.989268 | a | two |
2 | 0.179488 | -0.054570 | b | one |
3 | -0.299878 | -1.640494 | b | two |
4 | -0.297191 | 0.954447 | a | one |
You can see the usefulness of agg by comparing the following:
df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
print((index1, index2))
print(data)
9
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
0
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
1
min | |
---|---|
key1 | |
a | -0.627470 |
b | -0.299878 |
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
2
data1 | |
---|---|
key1 | |
a | -0.627470 |
b | -0.299878 |
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
3
max | min | |
---|---|---|
key1 | ||
a | -0.297191 | -0.627470 |
b | 0.179488 | -0.299878 |
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
4
data1 | ||
---|---|---|
min | max | |
key1 | ||
a | -0.627470 | -0.297191 |
b | -0.299878 | 0.179488 |
Column names can be corrected for the results of groupby (this is not recommended, even if the column names are changed separately later)
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
5
('America', 40000)
Age Country Income
4 250 America 40000
('China', 8000)
Age Country Income
6 4500 China 8000
('China', 10000)
Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
Age Country Income
3 4010 India 5002
('Japan', 50000)
Age Country Income
5 250 Japan 50000
6
a | b | |
---|---|---|
key1 | ||
a | -0.627470 | -0.297191 |
b | -0.299878 | 0.179488 |
Important tip: groupby is followed directly. reset_index () can get an DataFrame without multi-level index
It can then be renamed through df.rename ({'old_col1': 'new_col1', 'old_col2': 'new_col2',...})
eg:
df1= df.groupby(['date'])['price'].agg({'sum','count'}).reset_index()