Implementation of Python Pandas Packet Aggregation
- 2021-07-06 11:32:36
- OfStack
Pycharm mouse over the function, CTRL+Q to quickly view the document, CTR+P to see the basic parameters.
apply (), applymap (), and map ()
apply () and applymap () are functions of DataFrame, and map () is a function of Series.
The manipulation object of apply () is one row or one column of DataFrame, and applymap () is every element of DataFrame. map () is also every 1 element in Series.
apply () batches the contents of dataframe, which is faster than looping. For example, df. apply (func, axis=0,......) func: A defined function that operates on columns when axis=0 and rows when axis=1.
There is no difference between map () and python built-in, such as df ['one']. map (sqrt).
import numpy as np
from pandas import Series, DataFrame
frame = DataFrame(np.random.randn(4, 3),
columns = list('bde'),
index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
print frame
print np.abs(frame)
print
f = lambda x: x.max() - x.min()
print frame.apply(f)
print frame.apply(f, axis = 1)
def f(x):
return Series([x.min(), x.max()], index = ['min', 'max'])
print frame.apply(f)
print
print 'applymap And map'
_format = lambda x: '%.2f' % x
print frame.applymap(_format)
print frame['e'].map(_format)
Groupby
Groupby is the most commonly used and effective grouping function in Pandas, including sum (), count (), mean () and other statistical functions.
The DataFrameGroupBy object returned by the groupby method does not actually contain data content; It records the intermediate data of df ['key1']. When you apply functions or other aggregation operations to grouped data, pandas performs fast block operations on df based on the information recorded in groupby objects, and returns the results.
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
grouped = df.groupby(df['key1'])
print grouped.mean()
df.groupby(lambda x:'even' if x%2==0 else 'odd').mean() # Grouping by Function
Polymerized agg ()
For a column (row) or columns (row, axis=0/1) of a packet, agg (func) can be applied to the grouped data by applying the func function. For example, grouped ['data1']. agg ('mean') is also used to average the grouped 'data1' column. Of course, you can also work on multiple columns (rows) and use multiple functions at the same time.
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
grouped = df.groupby('key1')
print grouped.agg('mean')
data1 data2
key1
a 0.749117 0.220249
b -0.567971 -0.126922
apply () and agg () are similar in function. apply () is commonly used to handle the filling of missing data in different groups and the calculation of top N, which will produce hierarchical index.
agg can pass in multiple functions at the same time, acting on different columns.
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
grouped = df.groupby('key1')
print grouped.agg(['sum','mean'])
print grouped.apply(np.sum) #apply The is also applicable here, but you cannot pass in more than one , These two functions are basically common.
data1 data2
sum mean sum mean
key1
a 2.780273 0.926758 -1.561696 -0.520565
b -0.308320 -0.154160 -1.382162 -0.691081
data1 data2 key1 key2
key1
a 2.780273 -1.561696 aaa onetwoone
b -0.308320 -1.382162 bb onetwo
The functions of apply and agg are basically similar, but agg is more convenient when there are multiple functions.
apply itself has a high degree of freedom. If you don't do aggregation operation after grouping, apply will be useful.
print grouped.apply(lambda x: x.describe())
data1 data2
key1
a count 3.000000 3.000000
mean -0.887893 -1.042878
std 0.777515 1.551220
min -1.429440 -2.277311
25% -1.333350 -1.913495
50% -1.237260 -1.549679
75% -0.617119 -0.425661
max 0.003021 0.698357
b count 2.000000 2.000000
mean -0.078983 0.106752
std 0.723929 0.064191
min -0.590879 0.061362
25% -0.334931 0.084057
50% -0.078983 0.106752
75% 0.176964 0.129447
max 0.432912 0.152142
In addition, apply can also change the dimension of returned data.
http://pandas.pydata.org/pandas-docs/stable/groupby.html
There are also pivot table pivot_table and crosstab crosstab, but I haven't used them.