Parsing pandas apply of Function Usage of Recommendation
- 2021-12-11 18:21:13
- OfStack
To understand the functions of pandas, we should have a definite concept and understanding of functional programming. Functional programming, including functional programming thinking, is of course a very complicated topic, but for today's introduction
apply()
Function, just need to understand: Function as an object, can be passed as parameters to other functions, can also be the return value of the function.
Functions as objects can bring about great changes in code style. For example, there is a variable of type list, which contains data from 1 to 10, and you need to find all the numbers that can be divisible by 3. In the traditional way:
def can_divide_by_three(number):
if number % 3 == 0:
return True
else:
return False
selected_numbers = []
for number in range(1, 11):
if can_divide_by_three(number):
selected_numbers.append(number)
Loop is indispensable because
can_divide_by_three()
Function only once, consider simplifying it with lambda expression:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
The above is the traditional programming thinking mode, while the functional programming thinking is completely different. We can think like this: Take the number of specific rules from list, can we just pay attention to and set the rules, and let the programming language handle this kind of thing? Of course. When the programmer only cares about rules (which may be a condition or defined by some function), the code is greatly simplified and more readable.
Provided in Python language
filter()
Function with the following syntax:
filter(function, sequence)
filter()
The function (item) is executed on item in sequence in turn, and the item with the result of True is formed into one List/String/Tuple (depending on the type of sequence) and returned. With this function, the above code can be simplified as:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = filter(divide_by_three, range(1, 11))
Put the lambda expression in the statement, and the code is simplified to only one sentence:
selected_numbers = filter(lambda x: x % 3 == 0, range(1, 11))
Series.apply()
Back to the topic, pandas's
apply()
Function can act on
Series
Or the whole
DataFrame
The function is also to automatically traverse the whole
Series
Or
DataFrame
Runs the specified function on every 1 element.
For example, there is now a set of data, students' test scores:
Name Nationality Score
Zhang Han 400
Li Gback 450
Wang Han 460
If the nationality is not Han, the total score will be added to the test score by 5 points. Now we need to use pandas to do this calculation. We add 1 column to Dataframe. Of course, if it's just to get results,
numpy.where()
Function is simpler, which is mainly to demonstrate
can_divide_by_three()
0
Gets or sets the usage of the.
import pandas as pd
df = pd.read_csv("studuent-score.csv")
df['ExtraScore'] = df['Nationality'].apply(lambda x : 5 if x != ' Han ' else 0)
df['TotalScore'] = df['Score'] + df['ExtraScore']
For the 1 column Nationality, pandas traverses every 1 value, and executes the lambda anonymous function on this value, storing the calculation result in a new
Series
Returns from the. The above code shows the following results in jupyter notebook:
Name Nationality Score ExtraScore TotalScore
0 Zhang Han 400 0 400
1 Lee Hui 450 5 455
2 Wang Han 460 0 460
apply()
Of course, the function can also execute the built-in function of python. For example, we want to get the number of characters in this column of Name. If you use
apply()
Words:
df['NameLength'] = df['Name'].apply(len)
The apply function accepts functions with arguments
According to the pandas Help Documentation pandas. Series. apply-pandas 1.3. 1 documentation, this function can receive positional parameters or keyword parameters with the following syntax:
Series.apply(func, convert_dtype=True, args=(), **kwargs)
For func arguments, the first argument in the function definition is required, so the arguments of funct () other than the first argument are treated as additional arguments and passed as arguments. Let's still use the example just now to illustrate. Assuming that there are bonus points for a few ethnic groups except Han nationality, we put the bonus points in the parameters of the function and define an add_extra () function first:
def add_extra(nationality, extra):
if nationality != " Han ":
return extra
else:
return 0
Add 1 column to df:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
0
The positional parameter passes the parameter through args = () and is of type tuple. It can also be called by the following methods:
df['ExtraScore'] = df.Nationality.apply(add_extra, extra=5)
The result after running is:
Name Nationality Score ExtraScore
0 Zhang Han 400 0
1 Lee Hui 450 5
2 Wang Han 460 0
Use add_extra as the lambda function:
df['Extra'] = df.Nationality.apply(lambda n, extra : extra if n == ' Han ' else 0, args=(5,))
Let's continue to explain keyword parameters. Assuming that we can give different points to different nationalities, define the add_extra2 () function:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
3
The running result is:
Name Nationality Score Extra
0 Zhang Han 400 0
1 Lee Hui 450 10
2 Wang Han 460 0
Compared with the syntax of apply function, it is not difficult to understand.
DataFrame.apply()
DataFrame.apply()
The function iterates through every 1 element and runs the specified function on the element. For example, the following example:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
4
Execute on df
square()
Function, all elements are squared:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
5
If you just want to
apply()
Applies to the specified row and column. You can use the row or column's
name
Property. For example, the following example squares the x column:
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
6
x y z
a 1 2 3
b 16 5 6
c 49 8 9
The following example squares the x and y columns:
df.apply(lambda x : np.square(x) if x.name in ['x', 'y'] else x)
divide_by_three = lambda x : True if x % 3 == 0 else False
selected_numbers = []
for number in range(1, 11):
if divide_by_three(item):
selected_numbers.append(item)
9
The following example squares line 1 (the line where the a label is located):
df.apply(lambda x : np.square(x) if x.name == 'a' else x, axis=1)
By default
axis=0
Represents by column,
axis=1
Indicates by row.
apply () calculation date subtraction example
Usually, we often use the calculation of dates, such as calculating the interval between two dates, such as the following set of data about the start and end dates of wbs:
wbs date_from date_to
job1 2019-04-01 2019-05-01
job2 2019-04-07 2019-05-17
job3 2019-05-16 2019-05-31
job4 2019-05-20 2019-06-11
Assume that you want to calculate the number of days between the start and end dates. The simpler method is to subtract two columns (datetime type):
import pandas as pd
import datetime as dt
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elpased'] = df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)
apply()
Function sets the
date_from
And
date_to
Two columns are converted to datetime type. df under our print 1:
wbs date_from date_to elapsed
0 job1 2019-04-01 2019-05-01 30 days
1 job2 2019-04-07 2019-05-17 40 days
2 job3 2019-05-16 2019-05-31 15 days
3 job4 2019-05-20 2019-06-11 22 days
The date interval has been calculated, but it is followed by 1 unit days, because two
datetime
Type is subtracted, and the resulting data type is
timedelta64
If you only want numbers, you also need to use
timedelta
Adj.
days
Property conversion 1.
elapsed= df['date_to'].apply(pd.to_datetime) -
df['date_from'].apply(pd.to_datetime)
df['elapsed'] = elapsed.apply(lambda x : x.days)
Use
DataFrame.apply()
Function can achieve the same effect. We need to define a function first
get_interval_days()
The first column of the function is a
Series
Variable of type, when executed, receives every 1 row of DataFrame in turn.
import pandas as pd
import datetime as dt
def get_interval_days(arrLike, start, end):
start_date = dt.datetime.strptime(arrLike[start], '%Y-%m-%d')
end_date = dt.datetime.strptime(arrLike[end], '%Y-%m-%d')
return (end_date - start_date).days
wbs = {
"wbs": ["job1", "job2", "job3", "job4"],
"date_from": ["2019-04-01", "2019-04-07", "2019-05-16","2019-05-20"],
"date_to": ["2019-05-01", "2019-05-17", "2019-05-31", "2019-06-11"]
}
df = pd.DataFrame(wbs)
df['elapsed'] = df.apply(
get_interval_days, axis=1, args=('date_from', 'date_to'))
Reference
Apply function of Pandas-the best function in Pandas
pandas. Series. apply-pandas 1.3. 1 documentation