Parsing pandas apply of Function Usage of Recommendation

2021-12-11 18:21:13
OfStack

Directory Series. apply () apply Function Receives Function DataFrame. apply () apply () Calculates Date Subtraction Sample Reference

To understand the functions of pandas, we should have a definite concept and understanding of functional programming. Functional programming, including functional programming thinking, is of course a very complicated topic, but for today's introduction apply() Function, just need to understand: Function as an object, can be passed as parameters to other functions, can also be the return value of the function.

Functions as objects can bring about great changes in code style. For example, there is a variable of type list, which contains data from 1 to 10, and you need to find all the numbers that can be divisible by 3. In the traditional way:


def can_divide_by_three(number):
    if number % 3 == 0:
        return True
    else:
        return False

selected_numbers = []
for number in range(1, 11):
    if can_divide_by_three(number):
        selected_numbers.append(number)

Loop is indispensable because can_divide_by_three() Function only once, consider simplifying it with lambda expression:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

The above is the traditional programming thinking mode, while the functional programming thinking is completely different. We can think like this: Take the number of specific rules from list, can we just pay attention to and set the rules, and let the programming language handle this kind of thing? Of course. When the programmer only cares about rules (which may be a condition or defined by some function), the code is greatly simplified and more readable.

Provided in Python language filter() Function with the following syntax:


filter(function, sequence)

filter() The function (item) is executed on item in sequence in turn, and the item with the result of True is formed into one List/String/Tuple (depending on the type of sequence) and returned. With this function, the above code can be simplified as:


divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = filter(divide_by_three, range(1, 11))

Put the lambda expression in the statement, and the code is simplified to only one sentence:


selected_numbers = filter(lambda x: x % 3 == 0, range(1, 11))

Series.apply()

Back to the topic, pandas's apply() Function can act on Series Or the whole DataFrame The function is also to automatically traverse the whole Series Or DataFrame Runs the specified function on every 1 element.

For example, there is now a set of data, students' test scores:


  Name Nationality  Score
    Zhang             Han     400
    Li             Gback     450
    Wang             Han     460

If the nationality is not Han, the total score will be added to the test score by 5 points. Now we need to use pandas to do this calculation. We add 1 column to Dataframe. Of course, if it's just to get results, numpy.where() Function is simpler, which is mainly to demonstrate can_divide_by_three()0 Gets or sets the usage of the.


import pandas as pd

df = pd.read_csv("studuent-score.csv")
df['ExtraScore'] = df['Nationality'].apply(lambda x : 5 if x != ' Han ' else 0)
df['TotalScore'] = df['Score'] + df['ExtraScore']

For the 1 column Nationality, pandas traverses every 1 value, and executes the lambda anonymous function on this value, storing the calculation result in a new Series Returns from the. The above code shows the following results in jupyter notebook:

Name Nationality Score ExtraScore TotalScore
0 Zhang Han 400 0 400
1 Lee Hui 450 5 455
2 Wang Han 460 0 460

apply() Of course, the function can also execute the built-in function of python. For example, we want to get the number of characters in this column of Name. If you use apply() Words:


df['NameLength'] = df['Name'].apply(len)

The apply function accepts functions with arguments

According to the pandas Help Documentation pandas. Series. apply-pandas 1.3. 1 documentation, this function can receive positional parameters or keyword parameters with the following syntax:


Series.apply(func, convert_dtype=True, args=(), **kwargs)

For func arguments, the first argument in the function definition is required, so the arguments of funct () other than the first argument are treated as additional arguments and passed as arguments. Let's still use the example just now to illustrate. Assuming that there are bonus points for a few ethnic groups except Han nationality, we put the bonus points in the parameters of the function and define an add_extra () function first:


def add_extra(nationality, extra):
    if nationality != " Han ":
        return extra
    else:
        return 0

Add 1 column to df:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

The positional parameter passes the parameter through args = () and is of type tuple. It can also be called by the following methods:


df['ExtraScore'] = df.Nationality.apply(add_extra, extra=5)

The result after running is:

Name Nationality Score ExtraScore
0 Zhang Han 400 0
1 Lee Hui 450 5
2 Wang Han 460 0

Use add_extra as the lambda function:


df['Extra'] = df.Nationality.apply(lambda n, extra : extra if n == ' Han ' else 0, args=(5,))

Let's continue to explain keyword parameters. Assuming that we can give different points to different nationalities, define the add_extra2 () function:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

The running result is:

Name Nationality Score Extra
0 Zhang Han 400 0
1 Lee Hui 450 10
2 Wang Han 460 0

Compared with the syntax of apply function, it is not difficult to understand.

DataFrame.apply()

DataFrame.apply() The function iterates through every 1 element and runs the specified function on the element. For example, the following example:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

Execute on df square() Function, all elements are squared:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

If you just want to apply() Applies to the specified row and column. You can use the row or column's name Property. For example, the following example squares the x column:


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)


    x  y  z
a   1  2  3
b  16  5  6
c  49  8  9

The following example squares the x and y columns:


df.apply(lambda x : np.square(x) if x.name in ['x', 'y'] else x)


divide_by_three = lambda x : True if x % 3 == 0 else False

selected_numbers = []
for number in range(1, 11):
    if divide_by_three(item):
        selected_numbers.append(item)

The following example squares line 1 (the line where the a label is located):


df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)

By default axis=0 Represents by column, axis=1 Indicates by row.

apply () calculation date subtraction example

Usually, we often use the calculation of dates, such as calculating the interval between two dates, such as the following set of data about the start and end dates of wbs:


    wbs   date_from     date_to
  job1  2019-04-01  2019-05-01
  job2  2019-04-07  2019-05-17
  job3  2019-05-16  2019-05-31
  job4  2019-05-20  2019-06-11

Assume that you want to calculate the number of days between the start and end dates. The simpler method is to subtract two columns (datetime type):


import pandas as pd
import datetime as dt

wbs = {
    "wbs": ["job1", "job2", "job3", "job4"],
    "date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
    "date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}

df = pd.DataFrame(wbs)
df['elpased'] = df['date_to'].apply(pd.to_datetime) -   
               df['date_from'].apply(pd.to_datetime)

apply() Function sets the date_from And date_to Two columns are converted to datetime type. df under our print 1:


    wbs   date_from     date_to elapsed
0  job1  2019-04-01  2019-05-01 30 days
1  job2  2019-04-07  2019-05-17 40 days
2  job3  2019-05-16  2019-05-31 15 days
3  job4  2019-05-20  2019-06-11 22 days

The date interval has been calculated, but it is followed by 1 unit days, because two datetime Type is subtracted, and the resulting data type is timedelta64 If you only want numbers, you also need to use timedelta Adj. days Property conversion 1.


elapsed= df['date_to'].apply(pd.to_datetime) -
    df['date_from'].apply(pd.to_datetime)
df['elapsed'] = elapsed.apply(lambda x : x.days)

Use DataFrame.apply() Function can achieve the same effect. We need to define a function first get_interval_days() The first column of the function is a Series Variable of type, when executed, receives every 1 row of DataFrame in turn.


import pandas as pd
import datetime as dt

def get_interval_days(arrLike, start, end):   
    start_date = dt.datetime.strptime(arrLike[start], '%Y-%m-%d')
    end_date = dt.datetime.strptime(arrLike[end], '%Y-%m-%d') 

    return (end_date - start_date).days


wbs = {
    "wbs": ["job1", "job2", "job3", "job4"],
    "date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
    "date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}

df = pd.DataFrame(wbs)
df['elapsed'] = df.apply(
    get_interval_days, axis=1, args=('date_from', 'date_to'))

Reference

Apply function of Pandas-the best function in Pandas
pandas. Series. apply-pandas 1.3. 1 documentation