Implementation of pandas Hierarchical Index
- 2021-07-10 20:05:43
- OfStack
Hierarchical indexing is an important feature of pandas, which enables you to have multiple (more than two) index levels on a single axis.
Create an Series with a list of lists or arrays as an index.
data=Series(np.random.randn(10),
index=[['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]])
data
Out[6]:
a 1 -2.842857
2 0.376199
3 -0.512978
b 1 0.225243
2 -1.242407
3 -0.663188
c 1 -0.149269
2 -1.079174
d 2 -0.952380
3 -1.113689
dtype: float64
This is the formatted output form of Series with MultiIndex index. The "interval" between indexes means "use the above label directly".
data.index
Out[7]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
For a hierarchically indexed object, the operation of selecting a subset of data is simple:
data['b']
Out[8]:
1 0.225243
2 -1.242407
3 -0.663188
dtype: float64
data['b':'c']
Out[10]:
b 1 0.225243
2 -1.242407
3 -0.663188
c 1 -0.149269
2 -1.079174
dtype: float64
data.ix[['b','d']]
__main__:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[11]:
b 1 0.225243
2 -1.242407
3 -0.663188
d 2 -0.952380
3 -1.113689
dtype: float64
You can even choose from the "inner layer":
data[:,2]
Out[12]:
a 0.376199
b -1.242407
c -1.079174
d -0.952380
dtype: float64
Hierarchical indexing plays an important role in data remodeling and grouping-based operations.
It can be rearranged into 1 DataFrame by the unstack method:
data.unstack()
Out[13]:
1 2 3
a -2.842857 0.376199 -0.512978
b 0.225243 -1.242407 -0.663188
c -0.149269 -1.079174 NaN
d NaN -0.952380 -1.113689
#unstack The inverse operation of is stack
data.unstack().stack()
Out[14]:
a 1 -2.842857
2 0.376199
3 -0.512978
b 1 0.225243
2 -1.242407
3 -0.663188
c 1 -0.149269
2 -1.079174
d 2 -0.952380
3 -1.113689
dtype: float64
For DataFrame, each axis can have a hierarchical index:
frame=DataFrame(np.arange(12).reshape((4,3)),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['Ohio','Ohio','Colorado'],
['Green','Red','Green']])
frame
Out[16]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Each layer can have a name. If names are specified, they are displayed in the console (don't mix index names with axis labels!)
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
Out[22]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Because of the partial column index, you can easily select column groups:
frame['Ohio']
Out[23]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
Rearrangement hierarchical sorting
Sometimes you need to reorder the levels on an axis or sort data based on values at a specified level. swaplevel accepts two level numbers or names and returns a new object with levels interchanged (but the data does not change):
frame.swaplevel('key1','key2')
Out[24]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sortlevel sorts the data according to the values in a single level. When switching levels, sortlevel is often obtained, so that the final result is also orderly:
frame.swaplevel(0,1)
Out[27]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
# Exchange level 0,1 (I. E. key1,key2)
# And then to axis=0 Sort
frame.swaplevel(0,1).sortlevel(0)
__main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
Out[28]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
Summary statistics by level
Sometimes you need to reorder the levels on an axis or sort data based on values at a specified level. swaplevel accepts two level numbers or names and returns a new object with levels interchanged (but the data does not change):
data.index
Out[7]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
0
Columns using DataFrame
Use one or more columns of DataFrame as row indexes, or change row indexes into columns of Dataframe.
data.index
Out[7]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
1
The set_index function of DataFrame converts one or more of its columns to a row index and creates a new DataFrame:
data.index
Out[7]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
2
By default, those columns are removed from DataFrame, but you can also keep them:
frame.set_index(['c','d'],drop=False)
Out[35]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
The function of reset_index is the opposite of that of set_index, where the level of the hierarchical index is transferred to the column:
data.index
Out[7]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
4