The underlying implementation of Python machine learning KNN
- 2021-11-13 01:58:49
- OfStack
1. Import data
Importing data with the pandas library that comes with python is very simple. The data used is downloaded to the local wine collection.
The code is as follows (example):
import pandas as pd
def read_xlsx(csv_path):
data = pd.read_csv(csv_path)
print(data)
return data
2. Normalization
Distance will be used in KNN algorithm, so normalization is an important step, which can eliminate the dimension of data. I used normalization, and I can also use standardization to eliminate dimensions, but as a novice, I think normalization is relatively simple.
The maximum and minimum values are calculated by numpy library in python, the data imported by pandas is in DateFrame form, and np. array () is used to transform DateFrame form into ndarray form which can be calculated by numpy.
The code is as follows (example):
import numpy as np
def MinMaxScaler(data):
col = data.shape[1]
for i in range(0, col-1):
arr = data.iloc[:, i]
arr = np.array(arr) # Will DataFrame Form is transformed into ndarray Form, convenient for subsequent use numpy Calculation
min = np.min(arr)
max = np.max(arr)
arr = (arr-min)/(max-min)
data.iloc[:, i] = arr
return data
3. Divide training set and test set
First, the data value and label value are divided by x and y respectively, and the random number seed random_state is set. If it is not set, the results of each run will be different. test_size represents the test set ratio.
def train_test_split(data, test_size=0.2, random_state=None):
col = data.shape[1]
x = data.iloc[:, 0:col-1]
y = data.iloc[:, -1]
x = np.array(x)
y = np.array(y)
# Set the random seed, when the random seed is not empty, the random number will be locked
if random_state:
np.random.seed(random_state)
# Randomly scramble the index values of the sample set
# permutation Random generation 0-len(data) Random sequence
shuffle_indexs = np.random.permutation(len(x))
# Extraction is located in the sample set 20% The index value of
test_size = int(len(x) * test_size)
# Will be randomly disrupted 20% Assign the index value of to the test index
test_indexs = shuffle_indexs[:test_size]
# Will be randomly disrupted 80% Assign the index value of to the training index
train_indexs = shuffle_indexs[test_size:]
# Extracting training set and test set according to index
x_train = x[train_indexs]
y_train = y[train_indexs]
x_test = x[test_indexs]
y_test = y[test_indexs]
# Return the segmented data set
# print(y_train)
return x_train, x_test, y_train, y_test
Step 4 Calculate the distance
Euclidean distance is used here, and pow () function is used to calculate the power. length refers to the number of attribute values, which is used when calculating the nearest neighbor.
def CountDistance(train,test,length):
distance = 0
for x in range(length):
distance += pow(test[x] - train[x], 2)**0.5
return distance
Step 5 Choose your nearest neighbor
Calculate the distance between one piece of data in the test set and each piece of data in the training set, select the nearest k, and obtain the label value according to the principle that the minority obeys the majority. Among them, argsort returns the index value from small to large, in order to find the corresponding label value.
tip: Method of calculating mode with numpy
import numpy as np
#bincount (): Count the number of non-negative integers, not floating-point numbers
counts = np.bincount(nums)
# Return mode
np.argmax(counts)
The minority obeys the majority principle, calculates the mode, and returns the label value.
def getNeighbor(x_train,test,y_train,k):
distance = []
# Dimensions of test sets
length = x_train.shape[1]
# Distance of all training sets of test set
for x in range(x_train.shape[0]):
dist = CountDistance(test, x_train[x], length)
distance.append(dist)
distance = np.array(distance)
# Sort
distanceSort = distance.argsort()
# distance.sort(key= operator.itemgetter(1))
# print(len(distance))
# print(distanceSort[0])
neighbors =[]
for x in range(k):
labels = y_train[distanceSort[x]]
neighbors.append(labels)
# print(labels)
counts = np.bincount(neighbors)
label = np.argmax(counts)
# print(label)
return label
When a function is called:
getNeighbor(x_train,x_test[0],y_train,3)
6. Accuracy of calculation
The above KNN algorithm is used to predict the label value of every data in the test set, which is stored in result array, and the prediction result is compared with the real value, and the ratio of the correct number of prediction to the total number is calculated, which is the accuracy rate.
def getAccuracy(x_test,x_train,y_train,y_test):
result = []
k = 3
# arr_label = getNeighbor(x_train, x_test[0], y_train, k)
for x in range(len(x_test)):
arr_label = getNeighbor(x_train, x_test[x], y_train, k)
result.append(arr_label)
correct = 0
for x in range(len(y_test)):
if result[x] == y_test[x]:
correct += 1
# print(correct)
accuracy = (correct / float(len(y_test))) * 100.0
print("Accuracy:", accuracy, "%")
return accuracy
Summarize
KNN is the simplest algorithm in machine learning, which is relatively simple to implement, but for a novice like me, it takes most of the time to get it out.
The project was uploaded on github: https://github.com/chenyi369/KNN