Implementation of Python Pandas Packet Aggregation

  • 2021-07-06 11:32:36
  • OfStack

Pycharm mouse over the function, CTRL+Q to quickly view the document, CTR+P to see the basic parameters.

apply (), applymap (), and map ()

apply () and applymap () are functions of DataFrame, and map () is a function of Series.

The manipulation object of apply () is one row or one column of DataFrame, and applymap () is every element of DataFrame. map () is also every 1 element in Series.

apply () batches the contents of dataframe, which is faster than looping. For example, df. apply (func, axis=0,......) func: A defined function that operates on columns when axis=0 and rows when axis=1.

There is no difference between map () and python built-in, such as df ['one']. map (sqrt).


import numpy as np

from pandas import Series, DataFrame

 

frame = DataFrame(np.random.randn(4, 3),

         columns = list('bde'),

         index = ['Utah', 'Ohio', 'Texas', 'Oregon'])

print frame

print np.abs(frame)

print

 

f = lambda x: x.max() - x.min()

print frame.apply(f)

print frame.apply(f, axis = 1)

def f(x):

  return Series([x.min(), x.max()], index = ['min', 'max'])

print frame.apply(f)

print

 

print 'applymap And map'

_format = lambda x: '%.2f' % x

print frame.applymap(_format)

print frame['e'].map(_format) 

Groupby

Groupby is the most commonly used and effective grouping function in Pandas, including sum (), count (), mean () and other statistical functions.

The DataFrameGroupBy object returned by the groupby method does not actually contain data content; It records the intermediate data of df ['key1']. When you apply functions or other aggregation operations to grouped data, pandas performs fast block operations on df based on the information recorded in groupby objects, and returns the results.


df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': np.random.randn(5),

        'data2': np.random.randn(5)})

grouped = df.groupby(df['key1'])

print grouped.mean() 



df.groupby(lambda x:'even' if x%2==0 else 'odd').mean() # Grouping by Function  

Polymerized agg ()

For a column (row) or columns (row, axis=0/1) of a packet, agg (func) can be applied to the grouped data by applying the func function. For example, grouped ['data1']. agg ('mean') is also used to average the grouped 'data1' column. Of course, you can also work on multiple columns (rows) and use multiple functions at the same time.


df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': np.random.randn(5),

        'data2': np.random.randn(5)})

grouped = df.groupby('key1')

print grouped.agg('mean')

 

     data1   data2

key1          

a   0.749117 0.220249

b  -0.567971 -0.126922 

apply () and agg () are similar in function. apply () is commonly used to handle the filling of missing data in different groups and the calculation of top N, which will produce hierarchical index.

agg can pass in multiple functions at the same time, acting on different columns.


df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],

        'key2': ['one', 'two', 'one', 'two', 'one'],

        'data1': np.random.randn(5),

        'data2': np.random.randn(5)})

grouped = df.groupby('key1')

print grouped.agg(['sum','mean'])
print grouped.apply(np.sum)  #apply The is also applicable here, but you cannot pass in more than one , These two functions are basically common.  

data1 data2
sum mean sum mean
key1
a 2.780273 0.926758 -1.561696 -0.520565
b -0.308320 -0.154160 -1.382162 -0.691081


data1 data2 key1 key2
key1
a 2.780273 -1.561696 aaa onetwoone
b -0.308320 -1.382162 bb onetwo

The functions of apply and agg are basically similar, but agg is more convenient when there are multiple functions.

apply itself has a high degree of freedom. If you don't do aggregation operation after grouping, apply will be useful.


print grouped.apply(lambda x: x.describe())

 

        data1   data2

key1             

a  count 3.000000 3.000000

   mean -0.887893 -1.042878

   std  0.777515 1.551220

   min  -1.429440 -2.277311

   25%  -1.333350 -1.913495

   50%  -1.237260 -1.549679

   75%  -0.617119 -0.425661

   max  0.003021 0.698357

b  count 2.000000 2.000000

   mean -0.078983 0.106752

   std  0.723929 0.064191

   min  -0.590879 0.061362

   25%  -0.334931 0.084057

   50%  -0.078983 0.106752

   75%  0.176964 0.129447

   max  0.432912 0.152142 

In addition, apply can also change the dimension of returned data.

http://pandas.pydata.org/pandas-docs/stable/groupby.html

There are also pivot table pivot_table and crosstab crosstab, but I haven't used them.


Related articles: