Summary of Common Function Methods of Python Pandas

  • 2021-11-10 10:13:24
  • OfStack

Original intention

NumPy, Pandas, Matplotlib, SciPy and so on are the most commonly used Python libraries. When we use Python library, we usually encounter two situations. Take Pandas as an example.

I want to implement some kind of operation on the data of Pandas data structure, but I don't know or don't seem to remember in my impression whether there is such a functional method, and if so, which method should I use? I want to do some kind of data manipulation. I remember using or seeing a function that does this, but I can't remember what that function is called. Or, I remember which function would do this, but I wonder if there is a better choice.

At this time, everyone will start to conduct key searches with the help of Baidu, Zhihu, Google and CSDN. Of course you can do this, and you will get the result you want in the end, but you will face two small problems.

Sometimes I want to do this kind of operation on data, which I know in my heart, but I don't know how to describe this matter. The keywords are inaccurate, which leads to the deviation of search results, the poor search technology and many detours. Search for the results provided by others, but, either the typesetting is messy, or it is a long speech, starting from the function interface at 1:00 and 1:00, a pile of information you don't want, which makes you get less than the key point. Obviously, the problem that can be solved in 1 second takes you 1 minute to see other people's explanations and get the key point, wasting a lot of time.

Based on the above, I am thinking about how to solve such a problem. The solution is as follows: If you know but just forget that a function can do this, you will remember it when you see the name of the function. Therefore, I want to list the names of the most commonly used methods and functions directly, and then you can scan them directly or directly ctrl+f Search Chinese, you can easily evoke your memories. If you don't know if there is a function with the function you want, I will still list the name of the function, and then make a supplementary explanation in Chinese after it. If you scan these functions and their supplementary instructions, you will soon be able to judge whether there are functions that meet your needs.

The following contents are organized in this way. In Part 2, I list the commonly used Pandas function methods and their supplementary explanations. In Part 3, I give examples of the usage of these functions. You can copy them into the code for modification and use them directly. This is much easier than starting from the interface template at 1 point and 1 point. I believe this is what most good programmers want to do, in the shortest time. Of course, when the following content does not meet your needs, you can search one step further.

The following function method, covering more than 90% of usage, is worth collecting as a small dictionary query.

Most people don't remember these things unless you know them by heart and don't need to query them at all. May only remember that there is such a thing, all of which are now checking other people's codes, either copying them to change them, or copying them by hand. In particular, I have a lot of languages, so I often make a series of usages. Most of the time, I check and sell them now.

Pandas List of Most Commonly Used Functions


## 读写
pd.Series #定义1维标记数组
pd.DataFrame #定义数据框
pd.read_csv #读取逗号分隔符文件
pd.read_excel #读取 excel 表格
pd.to_excel #写入 excel 表格
pd.read_sql #读取 SQL 数据
pd.read_table #读取 table
pd.read_json #读取 json 文件
pd.read_html #读取 html
pd.read_clipboard() #从剪切板读入数据
df.to_csv #写入 csv 文件
df.to_excel #写入 excel 文件
df.to_sql #写入 SQL 表
df.to_json #写入 JSON 文件
df.to_html #写入 HTML 表格
df.to_clipboard() #写入剪切板

## 数据展示和统计
df.info() #统计数据信息
df.shape() #统计行数和列数
df.index() #显示索引总数
df.columns() #显示数据框有哪些列
df.count() #显示有多少个记录
df.head(n) #返回前 n 个,默认 5
df.tail(n) #返回后 n 个
df.sample(n) #随机选取 n 行
df.sample(frac = 0.8) #百分比为 0.8 的选取
df.dtypes #查看每1列的数据类型
df.sum() #数据框按列求和
df.cumsum() #数据框累计求和
df.min() #给出每列的最小值
df.max() #给出每列的最大值
df['列名'].idxmin() #获取数据框某1列的最小值
mySeries.idxmin() #获取 Series 的最小值
df['列名'].idxmax() #获取数据框某1列的最大值
mySeries.idxmax() #获取 Series 的最大值
df.describe() #关数据的基本统计信息描述
df.mean() #给出数据框每1列的均值
df.median() #给出数据框每1列的中位数
df.quantile #给出分位数
df.var() #统计每1列的方差
df.std() #统计每1列的标准差
df.cummax() #寻找累计最大值,即已出现中最大的1个
df.cummin() #累计最小值
df['列名'].cumproad() #计算累积连乘
len(df) #统计数据框长度
df.isnull #返回数据框是否包含 null 值
df.corr() #返回列之间的相关系数,以矩阵形式展示
df['列名'].value_counts() #列去重后给每个值计数

## 数据选择
mySeries['列名'] #用中括号获取列
df['列名'] #选取指定列
df.列名 #同上
df[n0:n1] #返回 n0 到 n1 行之间的数据框
df.iloc[[m],[n]] #iloc按行号来索引,两层中括号,取第 m 行第 n 列
df.loc[m:n] #loc 按标签来索引,返回索引 m 到 n 的数据框,loc、iloc 主要针对行来说的
df.loc[:,"列1":"列2"] #返回连续列的所有行
df.loc[m:n,"列1":"列2"] #返回连续列的固定行
df['列名'][n] #选取指定列的第 n 行
df[['列1','列2']] #返回多个指定的列

## 数据筛选和排序
df[df.列名 < n] #筛选,单中括号用于 bool 值筛选
df.filter(regex = 'code') #过滤器,按正则表达式筛选
df.sort_values #按某1列进行排序
df.sort_index() #按照索引升序排列
df['列名'].unique() #列去重
df['列名'].nunique() #列去重后的计数
df.nlargest(n,'列名') #返回 n 个最大值构成的数据框
df.nsmallest(n,'列名') #返回 n 个最小的数据框
df.rank #给出排名,即为第几名

## 数据增加删除修改
df["新列"] = xxx #定义新列
df.rename #给列重命名
df.index.name = "index_name" #设定或者修改索引名称
df.drop #删除行或者列
df.列名 = df.列名.astype('category') #列类型强制转化
df.append #在末尾追加1行
del df['删除的列'] #直接删除1列

## 特别的
df.列名.apply #按列的函数操作
pd.melt #将宽数据转化为长数据(拆分拉长),run 1下下面例子就知道什么意思了
pd.merge #两个数据表间的横向连接(内连接,外连接等)
pd.concat #横向或者纵向拼接

Example Usage of Pandas Function


mySeries = pd.Series([1,2,3,4], index=['a','b','c','d'])

data = {'Country' : ['Belgium', 'India', 'Brazil' ],
        'Capital': ['Brussels', 'New Delhi', 'Brassilia'],
        'Population': [1234,1234,1234]}
df = pd.DataFrame(data, columns=['Country','Capital','Population'])

pd.DataFrame(np.random.rand(20,5))

df = pd.read_csv('data.csv')

pd.read_excel('filename')
pd.to_excel('filename.xlsx', sheet_name='Sheet1')

df.quantile([0.25, 0.75]) #  Give every 1 Column in the 25% And 75% Quantile of 

filters = df.Date > '2021-06-1'
df[filters] # Select all rows with dates after a certain date 

df.filter(regex='^L') # Elect  L  Beginning column 

df.sort_values(' Column name ', ascending= False) # Arranged in ascending order by the value size of the specified column 

df.rename(columns= {' Old column name ' : ' New column name '}) # Modify a column name 

df[" New column "] = df.a- df.b # Definition 1 New columns are represented as the difference between two 

df.columns = map(str.lower(), df.columns) # All column names become lowercase letters 

df.columns = map(str.upper(), df.columns) # All column names become uppercase 

df.drop(columns=[' Column name ']) # Delete a 1 Column 
df.drop([' Column 1', ' Column 2'], axis=1) # Same meaning as above, delete two columns 
mySeries.drop(['a']) # Delete  Series  Specify value 
df.drop([0, 1]) # Delete according to index, double closed interval 

def fun(x):
    return x*3
df. Column name .apply(fun)  # Put sth 1 Column multiplication  3  Times 

df. Column name .apply(lambda x: x*3) # How to write anonymous expressions 

df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5}, 'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df, id_vars=['A'], value_vars=['B','C']) #melt Use of 

new=pd.DataFrame({'name':'lisa', 'gender':'F', 'city':' Beijing '},index=[1])
df = new
df=df.append(new) # Increase 1 Row data 

frame = pd.DataFrame({'a':[2.3,-1.7,5,3],'b':[6,2.9,-3.1,8]},index=['one','two','three','four'])
frame.rank(method="min",ascending=False)# To each 1 Column data, ranking according to size 


#merge  Represents a horizontal connection 
df3 = pd.merge(df1,df2,how='inner',on=' Stock abbreviation ') #on Represents a join column, how Choose Connection Mode 
pd.merge(df1,df2,left_on='lkey',right_on='rkey',how='left') # Specify separately when the connection column names are different 
#concat  Splice 
pd.concat([df1,df1])  # Longitudinal connection, when s1 And s2 When indexes do not overlap, they can be spliced directly 
pd.concat([df1,df1],axis = 1)  # Horizontal join, default external join, row index as the join field 

Related articles: