Using python for data loading

2021-11-13 01:59:12
OfStack

Preface

Recently, I participated in the team learning activity of datawhale. Under the mobilization of team learning, I began to force myself to output to achieve better input and processing. I started my first article publication from 6 to 15. I will write out the problems I really encountered, hoping to help you in front of the screen.

A lot of tedious automation in the work, pick up the python that I had touched in school before, and build up the required work in a fragmented way. The work is temporarily available, saving at least 3 hours of data processing every day. Holding the hammer of python in my hand, everything looks like a nail.

First, you have to learn to install the software, anaconda software. After successful installation, you click jupyter notebook to open the code box.

Now you can try to do data analysis.

1. Data loading

1.1 Loading data

Data set download https://www.kaggle.com/c/titanic/overview

1.1. 1 Import package

Import numpy and pandas


import pandas as pd
import numpy as np

If you make a mistake, you need to pay attention to the case and whether there are any wrong words

1.1. 2 Loading data

(1) Loading data using relative paths
(2) Loading data using an absolute path


df = pd.read_csv('train.csv')
df.head(3)


df = pd.read_csv('/Users/Documents/train.csv')
df.head(3)

Note that the "/" direction of the absolute path is not wrong.

1.1. 3 Read large files in blocks

One data module for every 1000 behaviors, read block by block


chunker = pd.read_csv('train.csv', chunksize=1000)

1.1.4

Modify the column name for the whole table: change the header to Chinese, and change the index to passenger ID. It is important to note that you should remember to pair the name with column 11, the number and the order

PassengerId = > Passenger ID
Survived = > Survive or not
Pclass = > Passenger class (1/2/3 class)
Name = > Passenger name
Sex = > Gender
Age = > Age
SibSp = > Number of cousins/sisters
Parch = > Number of parents and children
Ticket = > Ticket information
Fare = > Fare
Cabin = > Passenger cabin
Embarked = > Port of boarding


df = pd.read_csv('train.csv', names=[' Passenger ID',' Survive or not ',' Position grade ',' Name ',' Gender ',' Age ',' Number of siblings ',' Number of parents and children ',' Ticket information ',' Fare ',' Passenger cabin ',' Port of boarding '],index_col=' Passenger ID',header=0)
df.head()

1.2 Preliminary observation

After importing the data, we can overview the overall structure and samples of the data, such as the size of the data, how many columns there are, what format each column is in, whether it contains null, and so on. info with () and without () will have different content.


print(df.info())

If you want to view data in python, you can use head


df.head(10)
df.tail(15)

Whether the data is empty or not is judged. If the data is empty, True will be returned, and if the data is empty, False will be returned


df.isnull().head()

1.3 Saving Data

Save as a new file train_chinese. csv in the working directory. If you don't want the table to have index, you can add index=false


df.to_csv('train_chinese.csv',index=flase)