Summary of specific use of pandas. cut
- 2021-07-01 07:42:23
- OfStack
Use
pandas. cut is used to divide a set of data into discrete intervals. For example, if there is a set of age data, you can use pandas. cut to divide the age data into different age groups and label them.
Prototype
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4
Parameter meaning
x: Segmented class array (array-like) data, which must be 1-dimensional (DataFrame cannot be used);
bins: bins is a cut interval (or "bucket", "box", "bin"), which has three forms: a scalar of int type, a scalar sequence (array), or pandas. IntervalIndex.
A scalar of int type
When bins is a scalar of int type, it means that x is equally divided into bins parts. The range of x is expanded by 0.1% on each side to include the maximum and minimum values of x.
Scalar sequence
Scalar sequence defines the interval edge of every bin after segmentation, and x is not expanded at this time.
pandas.IntervalIndex
Define the exact interval to use.
right: bool type parameter, default to True, indicating whether the right part of the interval is included. For example, if bins= [1, 2, 3], right=True, the interval is (1, 2], (2, 3); right = False, then the intervals are (1, 2), (2, 3).
labels: Label the split bins. For example, after dividing the age x into the age bins, you can label the age such as youth and middle age. The length of labels must be equal to the length of the divided interval, for example, bins = [1, 2, 3], and there are two intervals (1, 2], (2, 3] after division, then the length of labels must be 2. If you specify
labels=False, the data in the returned x is in which bin (starting from 0).
retbins: The parameter of bool type, indicating whether the divided bins will be returned. It is useful when bins is a scalar of int type, so that the divided interval can be obtained, and the default is False.
precision: Keep the number of decimal places in the interval, and the default is 3.
include_lowest: A parameter of bool type, indicating whether the left side of the interval is open or closed, and the default is false, that is, the left part of the interval is not included (closed).
duplicates: Whether repeating intervals are allowed. There are two options: raise: Not allowed, drop: Allowed.
Return value
out: A value of type pandas. Categorical, Series, or ndarray, representing which bin (interval) each value in the partitioned x is in. If labels is specified, the corresponding label is returned.
bins: The delimited interval returned when retbins is specified as True.
Example
Here's an example of age grouping.
import numpy as np
import pandas as pd
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data
Divide ages equally into 5 intervals
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32])
pd.cut(ages, 5)
Output:
[(0.901, 20.8], (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], ..., (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], (20.8, 40.6]]
Length: 16
Categories (5, interval[float64]): [(0.901, 20.8] < (20.8, 40.6] < (40.6, 60.4] < (60.4, 80.2] < (80.2, 100.0]]
It can be seen that ages is divided into five intervals, and both sides of the intervals are expanded to include maximum and minimum values.
Divide ages equally into 5 intervals and specify labels
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data
pd.cut(ages, 5, labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "])
Output:
[Baby, baby, baby, youth, youth,..., baby, baby, youth, youth]
Length: 16
Categories (5, object): [Infants < Youth < Middle-aged < The prime of life < Old age]
The ages is assigned an interval to divide
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data
pd.cut(ages, [0,5,20,30,50,100], labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "])
Output:
[Baby, baby, youth, maturity, maturity,..., youth, youth, middle age, middle age, maturity]
Length: 16
Categories (5, object): [Baby < Youth < Middle-aged < The prime of life < Old age]
Instead of dividing ages equally, ages is divided into five intervals (0, 5], (5, 20), (20, 30), (30, 50) and (50, 100).
Returns the split bins
Let retbins=True
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data
pd.cut(ages, [0,5,20,30,50,100], labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "],retbins=True)
Output:
([Baby, baby, youth, mature, mature,..., youth, youth, middle age, middle age, mature])
Length: 16
Categories (5, object): [Baby < Youth < Middle-aged < The prime of life < Old age],
array([ 0, 5, 20, 30, 50, 100]))
Only the data in x is returned in which bin
Let labels=False
ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data
pd.cut(ages, [0,5,20,30,50,100], labels=False)
Output:
array([0, 0, 1, 3, 3, 1, 4, 4, 4, 4, 4, 1, 1, 2, 2, 3], dtype=int64)
The first 0 means that 1 is in the 0th bin.
Reference
1.https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html