Solution of Intra group Sorting for pandas groupby Grouped Objects
- 2021-10-15 11:06:07
- OfStack
Question:
Grouping according to a column of data, select the row data of the other column size top-K
Analysis:
The solution idea is very clear, that is to say, the data is grouped by groupby first, and then sorted according to a certain column after grouping, and the top-K results after sorting results are selected
Case:
Take the row with the highest value of C in each object of B column in dataframe under 1
df = pd.DataFrame({"A": [2, 3, 5, 4], "B": ['a', 'b', 'b', 'a'], "C": [200801, 200902, 200704, 201003]})
Introduction of Basic Functions of Groupby
groupby later returns an DataFrameGroupBy object, which has not actually been evaluated, but is just a temporary storage container,
[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>
Simple column selection of groupby results also returns DataFrameGroupBy/SeriesGroupBy objects, which cannot be visualized
[In]df.groupby('B')['A'] # Return SeriesGroupBy Object
[Out]<pandas.core.groupby.SeriesGroupBy object at 0x117f6b630>
[In]df.groupby('B')['A','C'] # Return DataFrameGroupBy Object
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x117fb84e0>
DataFrameGroupBy needs counting, statistics, agg aggregation calculation, apply mapping calculation and transform to generate visual data (only count and size functions are shown as examples below, and other operations are not involved)
[In] df.groupby('B', as_index=False)['A'].count() # Intra-group data statistics
[Out] B A
0 a 2
1 b 2
[In] df.groupby('B')['A'].size().reset_index(name='Size') # Group data statistics, size And count Adj. 1 The remarkable difference is that count Don't consider Nan , size Consider Nan
[Out] B Size
0 a 2
1 b 2
Solution 1:
For the DataFrameGroupBy object, the apply function is used to sort the sort_values of a column, and then the row where the maximum value is located is selected
# The return value is 1 With multiindex Adj. dataframe Data, in which level=0 For groupby Adj. by Column, and level=1 Be original index
[In] df.groupby('B').apply(lambda x: x.sort_values('C', ascending=False))
[Out] A B C
B
a 3 4 a 201003
0 2 a 200801
b 1 3 b 200902
2 5 b 200704
# By setting group_keys Parameter pair multiindex Optimize
[In] df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False))
[Out] A B C
3 4 a 201003
0 2 a 200801
1 3 b 200902
2 5 b 200704
# Again groupby And call the built-in first() Method, taking the maximum value
[In] df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False)).groupby('B').first().reset_index()
[Out] B A C
0 a 4 201003
1 b 3 200902
Solution 2:
Firstly, the whole sort_values is performed on B, and the value is obtained in groupy
[In] df.sort_values('C', ascending=False).groupby('B').first().reset_index()
[Out] B A C
0 a 4 201003
1 b 3 200902
Problem expansion:
The above only solves the problem of Top-1. What about Top-k?
Answer: Change the first () function to the head () function
[In] df.sort_values('C', ascending=False).groupby('B').head(2)
[Out] A B C
3 4 a 201003
1 3 b 200902
0 2 a 200801
2 5 b 200704
Summary:
1. Scheme 2, that is, it is more convenient to sort first and then take the value of groupby
2. There are many API in pandas, so it is necessary to clarify the types of return values in each step to facilitate memory and association
Supplement: pandas grouping groupby and agg, sorting sort, connecting concat and join
Connect concat to join
Transverse connection
pd.concat([df6,df7],axis=1)
df6.join(df7)
# df6 The table is in front, if necessary df7 The table of needs to swap places before
Note:
1. This method of concat can realize both horizontal connection and vertical connection. It is controlled by setting the value of axis. axis=1 indicates horizontal connection. If there are multiple connected objects, put them in the list
2. join can also be implemented
Longitudinal connection
pd.concat([df8,df9],ignore_index=True)
Note:
1. The data merged vertically needs to be aggregated with []
2. ignore_index ignores the original row index and rearranges it
3. drop_duplicates () Delete duplicate data
Sort
# Ranked according to grades
df10.sort_values('score')
# Default ascending order, from small to large
df10.sort_values(['score','group'],ascending=False,na_position='first')
#sort Individual attributes
参数 | 描述 |
---|---|
by | 字符串或者列表,如果是单个排序字段,使用的是字符串,如果指定多个,需要使用列表 |
ascending | True的时候,是按照升序,默认是升序 |
na_position | 表示的是空值的位置,'last'是默认的,'first'开始位置 |
Grouping
[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>
0
Note:
1. groupby If 1 column is specified, if multiple columns []
2. groupby returns 1 object, so it can't be accessed directly. You can use for
Filter out columns after grouping
If 1 column of data [[column name]] is filtered, the dataframe object is returned
If you filter out more than one column, use [] and [[]] directly
Summary [[Column 1, Column 2,. . . . ]]
Aggregate function agg is used together
[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>
1
函数 | 描述 |
---|---|
mean | 均值 |
max | 最大值 |
min | 最小值 |
median | 中位数 |
std | 标准差 |
count | 计数 |
skew | 偏度 |
quantile | 指定分位数 |