Solution of Intra group Sorting for pandas groupby Grouped Objects

  • 2021-10-15 11:06:07
  • OfStack

Question:

Grouping according to a column of data, select the row data of the other column size top-K

Analysis:

The solution idea is very clear, that is to say, the data is grouped by groupby first, and then sorted according to a certain column after grouping, and the top-K results after sorting results are selected

Case:

Take the row with the highest value of C in each object of B column in dataframe under 1


df = pd.DataFrame({"A": [2, 3, 5, 4], "B": ['a', 'b', 'b', 'a'], "C": [200801, 200902, 200704, 201003]})

Introduction of Basic Functions of Groupby

groupby later returns an DataFrameGroupBy object, which has not actually been evaluated, but is just a temporary storage container,


[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>

Simple column selection of groupby results also returns DataFrameGroupBy/SeriesGroupBy objects, which cannot be visualized


[In]df.groupby('B')['A']   #  Return SeriesGroupBy Object 
[Out]<pandas.core.groupby.SeriesGroupBy object at 0x117f6b630>

[In]df.groupby('B')['A','C']   #  Return DataFrameGroupBy Object 
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x117fb84e0>

DataFrameGroupBy needs counting, statistics, agg aggregation calculation, apply mapping calculation and transform to generate visual data (only count and size functions are shown as examples below, and other operations are not involved)


[In] df.groupby('B', as_index=False)['A'].count()  #  Intra-group data statistics 
[Out] B A
 0 a 2
 1 b 2

[In] df.groupby('B')['A'].size().reset_index(name='Size') #  Group data statistics, size And count Adj. 1 The remarkable difference is that count Don't consider Nan , size Consider Nan
[Out] B Size
 0  a 2
 1  b 2

Solution 1:

For the DataFrameGroupBy object, the apply function is used to sort the sort_values of a column, and then the row where the maximum value is located is selected


#  The return value is 1 With multiindex Adj. dataframe Data, in which level=0 For groupby Adj. by Column, and level=1 Be original index
[In] df.groupby('B').apply(lambda x: x.sort_values('C', ascending=False))
[Out] A B C
B  
a 3 4 a 201003
 0 2 a 200801
b 1 3 b 200902
 2 5 b 200704

#  By setting group_keys Parameter pair multiindex Optimize 
[In] df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False))
[Out] A B  C
 3 4 a 201003
 0 2 a 200801
 1 3 b 200902
 2 5 b 200704

#  Again groupby And call the built-in first() Method, taking the maximum value 
[In] df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False)).groupby('B').first().reset_index()
[Out]  B A  C
 0 a 4 201003
 1 b 3 200902

Solution 2:

Firstly, the whole sort_values is performed on B, and the value is obtained in groupy


[In] df.sort_values('C', ascending=False).groupby('B').first().reset_index()
[Out]  B A  C
 0 a 4 201003
 1 b 3 200902

Problem expansion:

The above only solves the problem of Top-1. What about Top-k?

Answer: Change the first () function to the head () function


[In] df.sort_values('C', ascending=False).groupby('B').head(2)
[Out] A B C
 3 4 a 201003
 1 3 b 200902
 0 2 a 200801
 2 5 b 200704

Summary:

1. Scheme 2, that is, it is more convenient to sort first and then take the value of groupby

2. There are many API in pandas, so it is necessary to clarify the types of return values in each step to facilitate memory and association

Supplement: pandas grouping groupby and agg, sorting sort, connecting concat and join

Connect concat to join

Transverse connection


pd.concat([df6,df7],axis=1)
df6.join(df7)

# df6 The table is in front, if necessary df7 The table of needs to swap places before 

Note:

1. This method of concat can realize both horizontal connection and vertical connection. It is controlled by setting the value of axis. axis=1 indicates horizontal connection. If there are multiple connected objects, put them in the list

2. join can also be implemented

Longitudinal connection


pd.concat([df8,df9],ignore_index=True)

Note:

1. The data merged vertically needs to be aggregated with []

2. ignore_index ignores the original row index and rearranges it

3. drop_duplicates () Delete duplicate data

Sort


# Ranked according to grades 
df10.sort_values('score')
# Default ascending order, from small to large 

df10.sort_values(['score','group'],ascending=False,na_position='first')
#sort Individual attributes 
参数 描述
by 字符串或者列表,如果是单个排序字段,使用的是字符串,如果指定多个,需要使用列表
ascending True的时候,是按照升序,默认是升序
na_position 表示的是空值的位置,'last'是默认的,'first'开始位置

Grouping


[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>
0

Note:

1. groupby If 1 column is specified, if multiple columns []

2. groupby returns 1 object, so it can't be accessed directly. You can use for

Filter out columns after grouping

If 1 column of data [[column name]] is filtered, the dataframe object is returned

If you filter out more than one column, use [] and [[]] directly

Summary [[Column 1, Column 2,. . . . ]]

Aggregate function agg is used together


[In]df.groupby('B')
[Out]<pandas.core.groupby.DataFrameGroupBy object at 0x11800f588>
1
函数 描述
mean 均值
max 最大值
min 最小值
median 中位数
std 标准差
count 计数
skew 偏度
quantile 指定分位数

Related articles: