How does pandas packet sort get the second largest data
- 2021-10-15 11:04:10
- OfStack
Python used to do data analysis is very convenient, a lot of online data to find the second largest method, but most of them are about SQL, so I challenge 1 to use Python to do this (mainly SQL is not good > _ < ) on the code.
1. I made up the data myself
In the actual work should import data from the database, how to export data from the database, I will add later.
import pandas as pd
df = pd.DataFrame([
{"class": 1, "name": "aa", "english": 120},
{"class": 1, "name": "bb", "english": 110},
{"class": 1, "name": "cc", "english": 110},
{"class": 1, "name": "dd", "english": 110},
{"class": 2, "name": "ee", "english": 120},
{"class": 2, "name": "ff", "english": 140},
{"class": 2, "name": "gg", "english": 130},
{"class": 2, "name": "hh", "english": 130},
{"class": 3, "name": "tt", "english": 130},
{"class": 4, "name": "xx", "english": 130},
{"class": 4, "name": "yy", "english": 130},
{"class": 5, "name": "zz", "english": None},
])
2. Take the second largest data in groups
def fun(df):
# english Data deduplication
sort_set = set(df["english"].values.tolist())
if len(sort_set)<=1:
# The amount of data is less than or equal to 1 , unable to get the first 2 Big data
return None
else:
# Take english Middle grade 2 A large value
sort_value = sorted(sort_set,reverse=True)[1]
temp_df = df[df["english"]==sort_value]
return temp_df
df = df.groupby(by=["class"]).apply(fun).reset_index(drop=True)
print(df)
The results are as follows:
class name english
0 1 bb 110.0
1 1 cc 110.0
2 1 dd 110.0
3 2 gg 130.0
4 2 hh 130.0
3. Finished, it's as simple as that
Of course, this can also be changed to take the largest, take the smallest, take the third largest, and so on......
Supplement: pandas is sorted by one column A, grouped by two columns B and C, and the row with the largest value of A column after grouping is selected
pandas is sorted by one column A, grouped by two columns B and C, and the row with the largest value of A column after grouping is selected
1. Requirements
Sort by column updateTime in reverse order, group by columns B and C, select the row of the last update time after grouping, and add the new index to the result.
2. Code
import pandas as pd
data = pd.read_csv('test.csv')
df = pd.DataFrame(data)
df = df.sort_values('updateTime', ascending=False).groupby(['B','C']).first().reset_index()
The first () function represents selecting the first row, and if you want to select more than one row, you can use the head () function: head (5) represents selecting the first five rows.
The following example:
import pandas as pd
data = pd.read_csv('test.csv',header = 0)
df = pd.DataFrame(data)
# Grouped by date, after grouping, pair value Columns are sorted from large to small, taking the front of each group 10 Row
df = df.groupby('date', group_keys=False).apply(lambda x: x.sort_values('value', ascending=False)).groupby('date').head(10).reset_index()
Use the apply () function when the sort_values () function cannot be used directly.