The underlying implementation of Python machine learning KNN

  • 2021-11-13 01:58:49
  • OfStack

1. Import data

Importing data with the pandas library that comes with python is very simple. The data used is downloaded to the local wine collection.

The code is as follows (example):


import pandas as pd
def read_xlsx(csv_path):
    data = pd.read_csv(csv_path)
    print(data)
    return data

2. Normalization

Distance will be used in KNN algorithm, so normalization is an important step, which can eliminate the dimension of data. I used normalization, and I can also use standardization to eliminate dimensions, but as a novice, I think normalization is relatively simple.

The maximum and minimum values are calculated by numpy library in python, the data imported by pandas is in DateFrame form, and np. array () is used to transform DateFrame form into ndarray form which can be calculated by numpy.

The code is as follows (example):


import numpy as np
def MinMaxScaler(data):
    col = data.shape[1]
    for i in range(0, col-1):
        arr = data.iloc[:, i]
        arr = np.array(arr) # Will DataFrame Form is transformed into ndarray Form, convenient for subsequent use numpy Calculation 
        min = np.min(arr)
        max = np.max(arr)
        arr = (arr-min)/(max-min)
        data.iloc[:, i] = arr
    return data

3. Divide training set and test set

First, the data value and label value are divided by x and y respectively, and the random number seed random_state is set. If it is not set, the results of each run will be different. test_size represents the test set ratio.


def train_test_split(data, test_size=0.2, random_state=None):
    col = data.shape[1]
    x = data.iloc[:, 0:col-1]
    y = data.iloc[:, -1]
    x = np.array(x)
    y = np.array(y)
    #  Set the random seed, when the random seed is not empty, the random number will be locked 
    if random_state:
        np.random.seed(random_state)
        #  Randomly scramble the index values of the sample set 
        # permutation Random generation 0-len(data) Random sequence 
    shuffle_indexs = np.random.permutation(len(x))
    #  Extraction is located in the sample set 20% The index value of 
    test_size = int(len(x) * test_size)
    #  Will be randomly disrupted 20% Assign the index value of to the test index 
    test_indexs = shuffle_indexs[:test_size]
    #  Will be randomly disrupted 80% Assign the index value of to the training index 
    train_indexs = shuffle_indexs[test_size:]
    #  Extracting training set and test set according to index 
    x_train = x[train_indexs]
    y_train = y[train_indexs]
    x_test = x[test_indexs]
    y_test = y[test_indexs]
    #  Return the segmented data set 
    # print(y_train)
    return x_train, x_test, y_train, y_test

Step 4 Calculate the distance

Euclidean distance is used here, and pow () function is used to calculate the power. length refers to the number of attribute values, which is used when calculating the nearest neighbor.


def CountDistance(train,test,length):
    distance = 0
    for x in range(length):
        distance += pow(test[x] - train[x], 2)**0.5
    return distance

Step 5 Choose your nearest neighbor

Calculate the distance between one piece of data in the test set and each piece of data in the training set, select the nearest k, and obtain the label value according to the principle that the minority obeys the majority. Among them, argsort returns the index value from small to large, in order to find the corresponding label value.

tip: Method of calculating mode with numpy


import numpy as np
#bincount (): Count the number of non-negative integers, not floating-point numbers 
counts = np.bincount(nums)
# Return mode 
np.argmax(counts)

The minority obeys the majority principle, calculates the mode, and returns the label value.


def getNeighbor(x_train,test,y_train,k):
    distance = []
    # Dimensions of test sets 
    length = x_train.shape[1]
    # Distance of all training sets of test set 
    for x in range(x_train.shape[0]):
        dist = CountDistance(test, x_train[x], length)
        distance.append(dist)
    distance = np.array(distance)
    # Sort 
    distanceSort = distance.argsort()
    # distance.sort(key= operator.itemgetter(1))
    # print(len(distance))
    # print(distanceSort[0])
    neighbors =[]
    for x in range(k):
        labels = y_train[distanceSort[x]]
        neighbors.append(labels)
        # print(labels)
    counts = np.bincount(neighbors)
    label = np.argmax(counts)
    # print(label)
    return label

When a function is called:


getNeighbor(x_train,x_test[0],y_train,3)

6. Accuracy of calculation

The above KNN algorithm is used to predict the label value of every data in the test set, which is stored in result array, and the prediction result is compared with the real value, and the ratio of the correct number of prediction to the total number is calculated, which is the accuracy rate.


def getAccuracy(x_test,x_train,y_train,y_test):
    result = []
    k = 3
    # arr_label = getNeighbor(x_train, x_test[0], y_train, k)
    for x in range(len(x_test)):
        arr_label = getNeighbor(x_train, x_test[x], y_train, k)
        result.append(arr_label)
    correct = 0
    for x in range(len(y_test)):
        if result[x] == y_test[x]:
           correct += 1
    # print(correct)
    accuracy = (correct / float(len(y_test))) * 100.0
    print("Accuracy:", accuracy, "%")
    return accuracy

Summarize

KNN is the simplest algorithm in machine learning, which is relatively simple to implement, but for a novice like me, it takes most of the time to get it out.

The project was uploaded on github: https://github.com/chenyi369/KNN


Related articles: