How does pandas packet sort get the second largest data

  • 2021-10-15 11:04:10
  • OfStack

Python used to do data analysis is very convenient, a lot of online data to find the second largest method, but most of them are about SQL, so I challenge 1 to use Python to do this (mainly SQL is not good > _ < ) on the code.

1. I made up the data myself

In the actual work should import data from the database, how to export data from the database, I will add later.


import pandas as pd
df = pd.DataFrame([
    {"class": 1, "name": "aa", "english": 120},
    {"class": 1, "name": "bb", "english": 110},
    {"class": 1, "name": "cc", "english": 110},
    {"class": 1, "name": "dd", "english": 110},
    {"class": 2, "name": "ee", "english": 120},
    {"class": 2, "name": "ff", "english": 140},
    {"class": 2, "name": "gg", "english": 130},
    {"class": 2, "name": "hh", "english": 130},
    {"class": 3, "name": "tt", "english": 130},
    {"class": 4, "name": "xx", "english": 130},
    {"class": 4, "name": "yy", "english": 130},
    {"class": 5, "name": "zz", "english": None},
  ])

2. Take the second largest data in groups


def fun(df):
  # english Data deduplication 
  sort_set = set(df["english"].values.tolist())
  if len(sort_set)<=1:
    #  The amount of data is less than or equal to 1 , unable to get the first 2 Big data 
    return None
  else:
    #  Take english Middle grade 2 A large value 
    sort_value = sorted(sort_set,reverse=True)[1]
    temp_df = df[df["english"]==sort_value]
    return temp_df
 
df = df.groupby(by=["class"]).apply(fun).reset_index(drop=True)
print(df)

The results are as follows:


  class name english
0 1 bb 110.0
1 1 cc 110.0
2 1 dd 110.0
3 2 gg 130.0
4 2 hh 130.0

3. Finished, it's as simple as that

Of course, this can also be changed to take the largest, take the smallest, take the third largest, and so on......

Supplement: pandas is sorted by one column A, grouped by two columns B and C, and the row with the largest value of A column after grouping is selected

pandas is sorted by one column A, grouped by two columns B and C, and the row with the largest value of A column after grouping is selected

1. Requirements

Sort by column updateTime in reverse order, group by columns B and C, select the row of the last update time after grouping, and add the new index to the result.

2. Code


import pandas as pd
data = pd.read_csv('test.csv')
df = pd.DataFrame(data)
df = df.sort_values('updateTime', ascending=False).groupby(['B','C']).first().reset_index()

The first () function represents selecting the first row, and if you want to select more than one row, you can use the head () function: head (5) represents selecting the first five rows.

The following example:


import pandas as pd 
data = pd.read_csv('test.csv',header = 0)
df = pd.DataFrame(data) 
#  Grouped by date, after grouping, pair  value  Columns are sorted from large to small, taking the front of each group 10 Row 
df = df.groupby('date', group_keys=False).apply(lambda x: x.sort_values('value', ascending=False)).groupby('date').head(10).reset_index()

Use the apply () function when the sort_values () function cannot be used directly.


Related articles: