An Example of pandas group Packet and agg Aggregation

2021-10-15 11:05:12
OfStack

As follows:


import pandas as pd
 
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'], 
     'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
     'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})

The constructed data is as follows:


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

Grouping

Single column grouping


df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)

Output


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000

Multi-column grouping


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)

Output


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

Polymerization

Aggregate the grouped data

Aggregate other columns after grouping by default


df_agg = df.groupby('Country').agg(['min', 'mean', 'max'])
print(df_agg)

Output


 Age     Income      
   min   mean max min   mean max
Country              
America 250 250.000000 250 40000 40000.000000 40000
China 4321 4607.000000 5000 8000 9333.333333 10000
India 1234 3188.333333 4321 5000 5000.666667 5002
Japan  250 250.000000 250 50000 50000.000000 50000

Aggregate some columns after grouping

In some cases, only some data needs to be aggregated differently, which can be constructed by dictionary


num_agg = {'Age':['min', 'mean', 'max']}
print(df.groupby('Country').agg(num_agg))

Output


 Age     
   min   mean max
Country       
America 250 250.000000 250
China 4321 4607.000000 5000
India 1234 3188.333333 4321
Japan  250 250.000000 250
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
print(df.groupby('Country').agg(num_agg))

Output


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

Supplement: pandas-very complete groupby, agg, grouping and statistics of tabular data

I don't write this groupby well. It's too complicated. In fact, only a few are often used. For example, it is easy to understand and use the commonly used ones by putting them into that one. Write another one in the future.

groupby Features: Grouping

groupby + agg (Aggregate Functions): After grouping, apply 1 function to each group, such as' sum ',' mean ',' max ',' min '…

groupby is grouped vertically by default, axis=0


DataFrame
import pandas as pd
import numpy as np


 df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
     'key2':['one', 'two', 'one', 'two', 'one'],
     'data1':np.random.randn(5),
     'data2':np.random.randn(5)})
print(df)


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

Grouping and iterating over the grouping


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

After list: [(group1), (group2), …]

Per data slice (group) format: (name, group) tuple

1. Grouped by key1 (1 column), which is actually the value of key1

The groupby object supports iteration and produces 1 set of 2 yuan tuples: (group name, data block), (group name, data block) …


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

2. Grouped by [key1, key2] (multiple columns)

For multiple keys, 1 set of 2 tuples is generated: ((k1, k2), data block), ((k1, k2), data block)

The first element is a tuple composed of key values


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000


 Age Country Income
0 5000 China 10000
1 4321 China 10000
2 1234 India 5000
3 4010 India 5002
4 250 America 40000
5 250 Japan 50000
6 4500 China 8000
7 4321 India 5000

3. Grouping by Function

4. Group by dictionary

5. Grouping by Index Level

6. Mixing functions with arrays, lists, dictionaries, Series is not a problem either, because everything will eventually be converted to arrays

Make these data fragments into dictionaries


dict(list(df.groupby(['key1'])))#dict(list())


{'a':  data1  data2 key1 key2
 0 -0.410122 0.247895 a one
 1 -0.627470 -0.989268 a two
 4 -0.297191 0.954447 a one, 'b':  data1  data2 key1 key2
 2 0.179488 -0.054570 b one
 3 -0.299878 -1.640494 b two}

After grouping, make some statistics, calculations, etc.

1. After grouping, return 1 Series with grouping size

Grouped by key1


df.groupby(['key1']).size()


key1
a 3
b 2
dtype: int64


dict(['a1','x2','e3'])



{'a': '1', 'e': '3', 'x': '2'}

Grouped by [key1, key2]


df.groupby(['key1','key2']).size()


df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)

2. Group data1 by key1 and calculate the average of data1 columns


df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)


df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)


df_gb = df.groupby('Country')
for index, data in df_gb:
 print(index)
 print(data)


key1
a -0.444928
b -0.060195
Name: data1, dtype: float64

Description:

groupby does not make any calculations. It is only divided into 1 groups.

The data (Series) is aggregated according to the grouping key, resulting in a new Series whose index is the only value in the key1 column.

The object returned by this indexing operation is either a grouped DataFrame (if a list or array is passed in) or a grouped Series


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000

3. Group data1 by [key1, key2] and calculate the average value of data1


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000

The data is grouped by two keys, and the resulting Series has one hierarchical index (composed of only one key pair):


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000

key2	one	two
key1
a	-0.353657	-0.627470
b	0.179488	-0.299878

In the above examples, the grouping key is Series. In fact, the grouping key can be any array of appropriate length. Very flexible.

In the transverse direction

By column data type (df. dtypes)

df has two data types: float64 and object, so it is divided into two groups (dtype ('float64'), data slice) and (dtype ('O'), data slice)


America
 Age Country Income
4 250 America 40000
China
 Age Country Income
0 5000 China 10000
1 4321 China 10000
6 4500 China 8000
India
 Age Country Income
2 1234 India 5000
3 4010 India 5002
7 4321 India 5000
Japan
 Age Country Income
5 250 Japan 50000


[(dtype('float64'),  data1  data2
 0 -0.410122 0.247895
 1 -0.627470 -0.989268
 2 0.179488 -0.054570
 3 -0.299878 -1.640494
 4 -0.297191 0.954447), (dtype('O'), key1 key2
 0 a one
 1 a two
 2 b one
 3 b two
 4 a one)]

Application of agg

groupby+agg can apply multiple functions to the results of groupby at the same time

Method agg () parameter for SeriesGroupBy:


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)

Return: Series for aggregated


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)

	min	max
1	10	20
2	30	40

Often used like this:

df

	data1	data2	key1	key2
0	-0.410122	0.247895	a	one
1	-0.627470	-0.989268	a	two
2	0.179488	-0.054570	b	one
3	-0.299878	-1.640494	b	two
4	-0.297191	0.954447	a	one

You can see the usefulness of agg by comparing the following:


df_gb = df.groupby(['Country', 'Income'])
for (index1, index2), data in df_gb:
 print((index1, index2))
 print(data)


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

	min
key1
a	-0.627470
b	-0.299878


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

	data1
key1
a	-0.627470
b	-0.299878


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

	max	min
key1
a	-0.297191	-0.627470
b	0.179488	-0.299878


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

	data1
	min	max
key1
a	-0.627470	-0.297191
b	-0.299878	0.179488

Column names can be corrected for the results of groupby (this is not recommended, even if the column names are changed separately later)


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000


('America', 40000)
 Age Country Income
4 250 America 40000
('China', 8000)
 Age Country Income
6 4500 China 8000
('China', 10000)
 Age Country Income
0 5000 China 10000
1 4321 China 10000
('India', 5000)
 Age Country Income
2 1234 India 5000
7 4321 India 5000
('India', 5002)
 Age Country Income
3 4010 India 5002
('Japan', 50000)
 Age Country Income
5 250 Japan 50000

	a	b
key1
a	-0.627470	-0.297191
b	-0.299878	0.179488

Important tip: groupby is followed directly. reset_index () can get an DataFrame without multi-level index

It can then be renamed through df.rename ({'old_col1': 'new_col1', 'old_col2': 'new_col2',...})

eg:


df1= df.groupby(['date'])['price'].agg({'sum','count'}).reset_index()