Simple and useful Python data analysis and machine learning code

  • 2021-11-14 06:03:45
  • OfStack

Why choose Python for data analysis?

Python is a dynamic, object-oriented scripting language, but also a simple, easy-to-understand programming language. Python entry simple, code readability, a good Python code, read like reading a foreign language article. This feature of Python is called "pseudocode", which allows you to focus on what kind of work tasks are accomplished, instead of obsessing about the syntax of Python.

In addition, Python is open source, and it has many excellent libraries, which can be used in data analysis and other fields. More importantly, Python has good compatibility with Hadoop, the most popular open source big data platform. Therefore, learning Python is a very cost-saving thing for data analysts who are interested in developing into big data analysis positions.

Many advantages of Python make it one of the most popular programming languages. Many companies at home and abroad have already used Python, such as YouTube, Google, Alibaba Cloud and so on.

Simple and useful Python data analysis and machine learning code

After this month's python data analysis and machine learning, I summed up some experiences, and also gained some excellent blogs of big brothers. Those who are interested can watch my favorites, without saying much nonsense, and get down to business directly.

Data analysis is roughly divided into three parts: data processing, model building and model testing. This article mainly explains how to deal with data

In order to analyze the data, we must first understand the panda library pandas that learns python. The following are some basic simple operation methods, and the calling methods of python are as follows


import pandas as pd

Method for python to read csv file through pandas


df= pd.read_csv("xxx.csv")
# Before outputting the contents of the file 5 Column 
print(df.head())
# Output csv All content 
print(df)

Method for viewing a column of data in csv


pandas.read_csv( ' file_name.csv', usecols = [0,1,2,3]) 
# Simple method 
df[" Attribute column name "]

pandas Method for Deleting Some Columns of csv Data


droplabels= ['x_cat4','x_cat5','x_cat8','x_cat9']
data=df.drop(droplabels,axis=1)

Data Cleaning Method of NAN by pandas


# Will the table contain nan Value column is deleted, and non-null data and index value are returned Series
df.dropna()
'''
dropna(axis=0,how='any',thresh=None) , how The optional value of the parameter is any Or all.all Only when the slice elements are all NA Discard the row when ( Column ) . thresh Is an integer type, eg:thresh=3, Then 1 There are at least one in the line 3 A NA Value is retained only when the. 
'''
data.fillna(0)                      # Will nan Replace with 0
print(data.fillna(data.mean()))     ###  Fill the missing data with the mean value of each column feature 
print(data.fillna(data.median()))    ###  Fill the missing data with the median of each column feature 
print(data.fillna(method='bfill'))   ###  With adjacent back ( back ) Feature fills in the preceding null value 
print(data.fillna(method='pad'))     ###  Fill the back null value with the adjacent front feature 
# Reference blog :https://blog.csdn.net/qq_21840201/article/details/81008566

Method for pandas to change csv file data


# Change the value and type of a column property 
df = df[df[' Rise and fall ']!='None']
df[' Rise and fall '] = df[' Rise and fall '].astype(np.float64)
df = pd.DataFrame(a, dtype='float') # Data type conversion 
# Reference link: http://www.45fan.com/article.php?aid=19070771581800099094144284
# Read and change all the data traversely, refer to the following 
for i in df.index:
    df["id1"][i]=1

Application and Function of iloc of pandas


X = df.iloc[:, data.columns != 'label']  #  Take out does not include  label Other columns 

df.iloc[:3, :2]           # Use .iloc  We only chose .iloc Before the 3 Row sum 2 Column 

The method of calculating the number of elements in a column


sum= len(data[data.label == 'BENIGN']) # Calculation BENIGN Quantity of 
len(df)       

pandas Method for Saving Files


#df For the data to save, xxx.csv For the saved file 
df.to_csv('xxx.csv', index=False, sep=',')

The above is pandas data processing of simple functions, which contains a reference to learning blog, interested students can watch learning. With these basic knowledge, we can process the data set, and then we have the problem of how to use it. Here is a simple routine.

1. Observe the data first, check the data type of each column through the code, and then check whether there is NAN value. You can delete the column or change the value according to the situation.

2. There may be some columns in the dataset whose attributes are time attributes. 1. Do not delete this column directly, but convert it to floating-point type

3. The conversion of string type to numeric type. Some strings need to be converted, which depends on the situation.

Summarize


Related articles: