python Implementation of KNN Nearest Neighbor Algorithm

  • 2021-09-05 00:18:44
  • OfStack

Example: "Classification of Movie Types"

Get data source

电影名称 打斗次数 接吻次数 电影类型
California Man 3 104 Romance
He's Not Really into Dudes 8 95 Romance
Beautiful Woman 1 81 Romance
Kevin Longblade 111 15 Action
Roob Slayer 3000 99 2 Action
Amped II 88 10 Action
Unknown 18 90 unknown

Data show: What is the movie type unknown judged by naked eyes


from matplotlib import pyplot as plt
​
#  Used to display Chinese labels normally 
plt.rcParams["font.sans-serif"] = ["SimHei"]
#  Movie title 
names = ["California Man", "He's Not Really into Dudes", "Beautiful Woman",
   "Kevin Longblade", "Robo Slayer 3000", "Amped II", "Unknown"]
#  Type label 
labels = ["Romance", "Romance", "Romance", "Action", "Action", "Action", "Unknown"]
colors = ["darkblue", "red", "green"]
colorDict = {label: color for (label, color) in zip(set(labels), colors)}
print(colorDict)
#  Number of fights, number of kisses 
X = [3, 8, 1, 111, 99, 88, 18]
Y = [104, 95, 81, 15, 2, 10, 88]
​
plt.title(" Judge the movie type by the number of fights and kisses ", fontsize=18)
plt.xlabel(" The number of times fighting scenes appear in movies ", fontsize=16)
plt.ylabel(" The number of kissing scenes in movies ", fontsize=16)
​
#  Drawing data 
for i in range(len(X)):
 #  Scatter plot drawing 
 plt.scatter(X[i], Y[i], color=colorDict[labels[i]])
​
#  Add description information to each point 
for i in range(0, 7):
 plt.text(X[i]+2, Y[i]-1, names[i], fontsize=14)
​
plt.show()

Problem analysis: What is the movie type unknown based on known information

Core idea:

The class of an unlabeled sample is determined by the class of its nearest K neighbors

Distance measurement:

The general distance calculation uses Euclidean distance (Pythagorean theorem to calculate distance), and Manhattan distance (sum of horizontal and vertical distances), cosine value and similarity (this is another expression of distance). Compared with the above distance, Mahalanobis distance is more accurate because it can consider many factors, such as unit, which may not exist in the process of finding the inverse matrix of covariance matrix, and if it encounters three dimensions or more, the solution process is extremely complicated, so Mahalanobis distance can not be used

Knowledge expansion

Mahalanobis distance concept: representing covariance distance of data Variance: The average of the square of the distance from each point in the data set to the mean point Standard deviation: the square of variance Covariance cov (x, y): E represents mean value, D represents variance, x and y represent different data sets, and xy represents the corresponding product of data set elements to form a data set

cov(x, y) = E(xy) - E(x)*E(y)

cov(x, x) = D(x)

cov(x1+x2, y) = cov(x1, y) + cov(x2, y)

cov(ax, by) = abcov(x, y)

Covariance matrix: According to the matrix composed of dimensions, it is assumed that there are three dimensions, a, b and c

Σ ij = [cov (a, a) cov (a, b) cov (a, c) cov (b, a) cov (b, b) cov (b, c) cov (c, a) cov (c, c)]

Algorithm implementation: Euclidean distance

Coding implementation


#  Custom implementation  mytest1.py
import numpy as np
​
#  Create a dataset 
def createDataSet():
 features = np.array([[3, 104], [8, 95], [1, 81], [111, 15],
       [99, 2], [88, 10]])
 labels = ["Romance", "Romance", "Romance", "Action", "Action", "Action"]
 return features, labels
​
def knnClassify(testFeature, trainingSet, labels, k):
 """
 KNN The algorithm is realized by Euclidean distance 
 :param testFeature:  Test data set, ndarray Type, 1 Dimensional array 
 :param trainingSet:  Training data set, ndarray Type, 2 Dimensional array 
 :param labels:  The training set corresponds to labels, ndarray Type, 1 Dimensional array 
 :param k: k Value, int Type 
 :return:  Prediction results, types, and elements in tags 1 To 
 """
 dataSetsize = trainingSet.shape[0]
 """
  Build 1 By dataSet[i] - testFeature New dataset of diffMat
 diffMat Each element in the dataSet Each feature in the testFeature Difference of (Euclidean distance median difference) 
 """
 testFeatureArray = np.tile(testFeature, (dataSetsize, 1))
 diffMat = testFeatureArray - trainingSet
 #  Square each difference 
 sqDiffMat = diffMat ** 2
 #  Calculation dataSet Each attribute in the testFeature The sum of the squares of the difference of 
 sqDistances = sqDiffMat.sum(axis=1)
 #  Calculate each feature And testFeature Euclidean distance between 
 distances = sqDistances ** 0.5
​
 """
  Sort, record in order from small to large distances The location of each data in the 
  Such as distance = [5, 9, 0, 2]
  Then sortedStance = [2, 3, 0, 1]
 """
 sortedDistances = distances.argsort()
​
 #  Choose the one with the smallest distance k Point 
 classCount = {}
 for i in range(k):
  voteiLabel = labels[list(sortedDistances).index(i)]
  classCount[voteiLabel] = classCount.get(voteiLabel, 0) + 1
 #  Right k The results are counted and sorted, the final results are selected, and the dictionary is according to value Sort values from large to small 
 sortedclassCount = sorted(classCount.items(), key=lambda x: x[1], reverse=True)
 return sortedclassCount[0][0]
​
testFeature = np.array([100, 200])
features, labels = createDataSet()
res = knnClassify(testFeature, features, labels, 3)
print(res)
#  Use python Package implementation  mytest2.py
from sklearn.neighbors import KNeighborsClassifier
from .mytest1 import createDataSet
​
features, labels = createDataSet()
k = 5
clf = KNeighborsClassifier(k_neighbors=k)
clf.fit(features, labels)
​
#  Sample value 
my_sample = [[18, 90]]
res = clf.predict(my_sample)
print(res)

Example: "Dating Site Matching Effect Prediction"

Data source: omitted

Data display


import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
​
#  Data loading 
def loadDatingData(file):
 datingData = pd.read_table(file, header=None)
 datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]
 datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])
 datingTrainLabel = np.array(datingData["label"])
 return datingData, datingTrainData, datingTrainLabel
​
# 3D Graph display data 
def dataView3D(datingTrainData, datingTrainLabel):
 plt.figure(1, figsize=(8, 3))
 plt.subplot(111, projection="3d")
 plt.scatter(np.array([datingTrainData[x][0]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "smallDoses"]),
    np.array([datingTrainData[x][1]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "smallDoses"]),
    np.array([datingTrainData[x][2]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "smallDoses"]), c="red")
 plt.scatter(np.array([datingTrainData[x][0]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "didntLike"]),
    np.array([datingTrainData[x][1]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "didntLike"]),
    np.array([datingTrainData[x][2]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "didntLike"]), c="green")
 plt.scatter(np.array([datingTrainData[x][0]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "largeDoses"]),
    np.array([datingTrainData[x][1]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "largeDoses"]),
    np.array([datingTrainData[x][2]
       for x in range(len(datingTrainLabel))
       if datingTrainLabel[x] == "largeDoses"]), c="blue")
 plt.xlabel(" Flight mileage ", fontsize=16)
 plt.ylabel(" Percentage of video game time spent ", fontsize=16)
 plt.clabel(" Ice cream consumption ", fontsize=16)
 plt.show()
 
datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH1)
datingView3D(datingTrainData, datingTrainLabel)

Problem analysis: Extract the first 10% of the data set and test it in the last 90% of the data set

Coding implementation


#  Custom method implementation 
import pandas as pd
import numpy as np
​
#  Data loading 
def loadDatingData(file):
 datingData = pd.read_table(file, header=None)
 datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]
 datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])
 datingTrainLabel = np.array(datingData["label"])
 return datingData, datingTrainData, datingTrainLabel
​
#  Data return 1 Chemical 
def autoNorm(datingTrainData):
 #  Gets the data set per 1 Maximum value of column 
 minValues, maxValues = datingTrainData.min(0), datingTrainData.max(0)
 diffValues = maxValues - minValues
 
 #  Define shapes and datingTrainData Similar minimum and difference matrices 
 m = datingTrainData.shape(0)
 minValuesData = np.tile(minValues, (m, 1))
 diffValuesData = np.tile(diffValues, (m, 1))
 normValuesData = (datingTrainData-minValuesData)/diffValuesData
 return normValuesData
​
#  Core algorithm implementation 
def KNNClassifier(testData, trainData, trainLabel, k):
 m = trainData.shape(0)
 testDataArray = np.tile(testData, (m, 1))
 diffDataArray = (testDataArray - trainData) ** 2
 sumDataArray = diffDataArray.sum(axis=1) ** 0.5
 #  Sort the results 
 sumDataSortedArray = sumDataArray.argsort()
 
 classCount = {}
 for i in range(k):
  labelName = trainLabel[list(sumDataSortedArray).index(i)]
  classCount[labelName] = classCount.get(labelName, 0)+1
 classCount = sorted(classCount.items(), key=lambda x: x[1], reversed=True)
 return classCount[0][0]
 
​
#  Data testing 
def datingTest(file):
 datingData, datingTrainData, datingTrainLabel = loadDatingData(file)
 normValuesData = autoNorm(datingTrainData)
 
 
 errorCount = 0
 ratio = 0.10
 total = datingTrainData.shape(0)
 numberTest = int(total * ratio)
 for i in range(numberTest):
  res = KNNClassifier(normValuesData[i], normValuesData[numberTest:m], datingTrainLabel, 5)
  if res != datingTrainLabel[i]:
   errorCount += 1
 print("The total error rate is : {}\n".format(error/float(numberTest)))
​
if __name__ == "__main__":
 FILEPATH = "./datingTestSet1.txt"
 datingTest(FILEPATH)
# python  No. 1 3 Square package implementation 
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
​
if __name__ == "__main__":
 FILEPATH = "./datingTestSet1.txt"
 datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH)
 normValuesData = autoNorm(datingTrainData)
 errorCount = 0
 ratio = 0.10
 total = normValuesData.shape[0]
 numberTest = int(total * ratio)
 
 k = 5
 clf = KNeighborsClassifier(n_neighbors=k)
 clf.fit(normValuesData[numberTest:total], datingTrainLabel[numberTest:total])
 
 for i in range(numberTest):
  res = clf.predict(normValuesData[i].reshape(1, -1))
  if res != datingTrainLabel[i]:
   errorCount += 1
 print("The total error rate is : {}\n".format(errorCount/float(numberTest)))

The above is python implementation KNN nearest neighbor algorithm details, more about python implementation KNN nearest neighbor algorithm information please pay attention to this site other related articles!


Related articles: