Summary of specific use of pandas. cut

  • 2021-07-01 07:42:23
  • OfStack

Use

pandas. cut is used to divide a set of data into discrete intervals. For example, if there is a set of age data, you can use pandas. cut to divide the age data into different age groups and label them.

Prototype


pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4

Parameter meaning

x: Segmented class array (array-like) data, which must be 1-dimensional (DataFrame cannot be used);

bins: bins is a cut interval (or "bucket", "box", "bin"), which has three forms: a scalar of int type, a scalar sequence (array), or pandas. IntervalIndex.

A scalar of int type

When bins is a scalar of int type, it means that x is equally divided into bins parts. The range of x is expanded by 0.1% on each side to include the maximum and minimum values of x.

Scalar sequence

Scalar sequence defines the interval edge of every bin after segmentation, and x is not expanded at this time.

pandas.IntervalIndex

Define the exact interval to use.

right: bool type parameter, default to True, indicating whether the right part of the interval is included. For example, if bins= [1, 2, 3], right=True, the interval is (1, 2], (2, 3); right = False, then the intervals are (1, 2), (2, 3).

labels: Label the split bins. For example, after dividing the age x into the age bins, you can label the age such as youth and middle age. The length of labels must be equal to the length of the divided interval, for example, bins = [1, 2, 3], and there are two intervals (1, 2], (2, 3] after division, then the length of labels must be 2. If you specify

labels=False, the data in the returned x is in which bin (starting from 0).

retbins: The parameter of bool type, indicating whether the divided bins will be returned. It is useful when bins is a scalar of int type, so that the divided interval can be obtained, and the default is False.

precision: Keep the number of decimal places in the interval, and the default is 3.

include_lowest: A parameter of bool type, indicating whether the left side of the interval is open or closed, and the default is false, that is, the left part of the interval is not included (closed).

duplicates: Whether repeating intervals are allowed. There are two options: raise: Not allowed, drop: Allowed.

Return value

out: A value of type pandas. Categorical, Series, or ndarray, representing which bin (interval) each value in the partitioned x is in. If labels is specified, the corresponding label is returned.

bins: The delimited interval returned when retbins is specified as True.

Example

Here's an example of age grouping.


import numpy as np
import pandas as pd

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data 

Divide ages equally into 5 intervals


ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) 
pd.cut(ages, 5)

Output:

[(0.901, 20.8], (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], ..., (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], (20.8, 40.6]]
Length: 16
Categories (5, interval[float64]): [(0.901, 20.8] < (20.8, 40.6] < (40.6, 60.4] < (60.4, 80.2] < (80.2, 100.0]]

It can be seen that ages is divided into five intervals, and both sides of the intervals are expanded to include maximum and minimum values.

Divide ages equally into 5 intervals and specify labels


ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data 
pd.cut(ages, 5, labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "])

Output:

[Baby, baby, baby, youth, youth,..., baby, baby, youth, youth]
Length: 16
Categories (5, object): [Infants < Youth < Middle-aged < The prime of life < Old age]

The ages is assigned an interval to divide


ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data 
pd.cut(ages, [0,5,20,30,50,100], labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "])

Output:

[Baby, baby, youth, maturity, maturity,..., youth, youth, middle age, middle age, maturity]
Length: 16
Categories (5, object): [Baby < Youth < Middle-aged < The prime of life < Old age]

Instead of dividing ages equally, ages is divided into five intervals (0, 5], (5, 20), (20, 30), (30, 50) and (50, 100).

Returns the split bins

Let retbins=True


ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data 
pd.cut(ages, [0,5,20,30,50,100], labels=[u" Baby ",u" Youth ",u" Middle-aged ",u" The prime of life ",u" Old age "],retbins=True)

Output:

([Baby, baby, youth, mature, mature,..., youth, youth, middle age, middle age, mature])
Length: 16
Categories (5, object): [Baby < Youth < Middle-aged < The prime of life < Old age],
array([ 0, 5, 20, 30, 50, 100]))

Only the data in x is returned in which bin

Let labels=False


ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) # Age data 
pd.cut(ages, [0,5,20,30,50,100], labels=False)

Output:

array([0, 0, 1, 3, 3, 1, 4, 4, 4, 4, 4, 1, 1, 2, 2, 3], dtype=int64)

The first 0 means that 1 is in the 0th bin.

Reference

1.https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html


Related articles: