Use of grouping function groupby and grouping operation function agg in python
- 2021-12-11 08:31:25
- OfStack
Today, I will introduce a very useful function groupby in pandas. In fact, the effect of groupby in hive is the same, and the difference lies in the writing of the two languages. groupby is useful in grouping statistics in Python ~
groupby:
Create the data first:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 7, 1, 3, 3, 2, 4, 8],
'C': [100, 87, 96, 130, 105, 87, 96, 155]})
df
Out[2]:
A B C
0 a 2 100
1 b 7 87
2 a 1 96
3 c 3 130
4 a 3 105
5 c 2 87
6 b 4 96
Basic operation of groupby in pandas:
1. Grouping according to A column, and calculating the average value of B and C columns:
df.groupby('A').mean()
Out[6]:
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000
Of course, you can also group according to multiple columns to obtain the average value of other columns:
df.groupby(['A','B']).mean()
Out[7]:
C
A B
a 1 96
2 100
3 105
b 4 96
7 87
c 2 87
3 130
8 155
2. After grouping, select columns for calculation:
data=df.groupby('A')
data['B'].std()
Out[11]:
A
a 1.00000
b 2.12132
c 3.21455
Name: B, dtype: float64
# Select B , C Two columns
data['B','C'].mean()
Out[12]:
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000
3. After grouping by A, you can use different aggregation methods for different columns (ps: this point is very similar to hive)
data.agg({'B':'mean','C':'sum'}) #B Column mean, C Column summary
Out[14]:
C B
A
a 301 2.000000
b 183 5.500000
c 372 4.333333
4. If we use the same aggregation method for multiple columns after grouping according to A, we can use apply function:
df.groupby('A').apply(np.mean)
Out[25]:
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000
5. Divide a column of data into different range segments according to data values for grouping operation
To create a dataset:
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
Out[38]:
Age Sex number_of_foo
0 64 Female 14
1 67 Female 14
2 20 Female 12
3 23 Male 17
4 23 Female 15
Goal: The age fields are divided into three groups, which are achieved in the following two ways:
# No. 1 1 Methods:
1 , bins=4
pd.cut(df['Age'], bins=4)
0 (56.75, 69.0]
1 (56.75, 69.0]
2 (19.951, 32.25]
3 (19.951, 32.25]
4 (19.951, 32.25]...
# No. 1 2 Methods
2 , bins=[19, 40, 65, np.inf]
pd.cut(df['Age'], bins=[19,40,65,np.inf])
Out[40]:
0 (40.0, 65.0]
1 (65.0, inf]
2 (19.0, 40.0]
3 (19.0, 40.0]
4 (19.0, 40.0]
# The grouping range results are as follows:
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()
Out[43]:
Age number_of_foo
Age
(19.0, 40.0] 29.840000 9.880000
(40.0, 65.0] 52.833333 9.452381
(65.0, inf] 67.375000 9.250000
# Press' Age' Cluster scope and gender ( sex ) Make a crosstab
pd.crosstab(age_groups, df['Sex'])
Out[44]:
Sex Female Male
Age
(19.0, 40.0] 22 28
(40.0, 65.0] 18 24
(65.0, inf] 3 5
agg:
1. After grouping by one column (A) using groupby, you need to adopt a different aggregation method for the other column:
df.groupby('A')['B'].agg({'mean':np.mean, 'std': np.std})
Out[16]:
std mean
A
a 1.00000 2.000000
b 2.12132 5.500000
c 3.21455 4.333333
2. After grouping according to a column, adopt different aggregation methods for different columns:
df.groupby('A').agg({'B':[np.mean,'sum'],'C':['count',np.std]}) #[] There are two methods corresponding to the
Out[17]:
C B
count std mean sum
A
a 3 4.509250 2.000000 6
b 2 6.363961 5.500000 11
c 3 34.394767 4.333333 13
transform:
The results obtained by the first two methods are indexed by A column values. What should I do if I use index without groupby grouping? At this point, the transform function will be used. The transform (func, args, * kwargs) method simplifies this process by applying the func parameter to all groupings and then placing the result on the index of the original array:
df.groupby('A').mean()
Out[6]:
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000
0
It can be seen that when grouping according to A column and counting B and C columns, the indexes with a as a have [0, 2, 4], so the values of [0, 2, 4] indexes in the result column are all 3, which is equivalent to broadcasting. For the C column, the same is true.