python Implementation of KNN Nearest Neighbor Algorithm
- 2021-09-05 00:18:44
- OfStack
Example: "Classification of Movie Types"
Get data source
电影名称 | 打斗次数 | 接吻次数 | 电影类型 |
---|---|---|---|
California Man | 3 | 104 | Romance |
He's Not Really into Dudes | 8 | 95 | Romance |
Beautiful Woman | 1 | 81 | Romance |
Kevin Longblade | 111 | 15 | Action |
Roob Slayer 3000 | 99 | 2 | Action |
Amped II | 88 | 10 | Action |
Unknown | 18 | 90 | unknown |
Data show: What is the movie type unknown judged by naked eyes
from matplotlib import pyplot as plt
# Used to display Chinese labels normally
plt.rcParams["font.sans-serif"] = ["SimHei"]
# Movie title
names = ["California Man", "He's Not Really into Dudes", "Beautiful Woman",
"Kevin Longblade", "Robo Slayer 3000", "Amped II", "Unknown"]
# Type label
labels = ["Romance", "Romance", "Romance", "Action", "Action", "Action", "Unknown"]
colors = ["darkblue", "red", "green"]
colorDict = {label: color for (label, color) in zip(set(labels), colors)}
print(colorDict)
# Number of fights, number of kisses
X = [3, 8, 1, 111, 99, 88, 18]
Y = [104, 95, 81, 15, 2, 10, 88]
plt.title(" Judge the movie type by the number of fights and kisses ", fontsize=18)
plt.xlabel(" The number of times fighting scenes appear in movies ", fontsize=16)
plt.ylabel(" The number of kissing scenes in movies ", fontsize=16)
# Drawing data
for i in range(len(X)):
# Scatter plot drawing
plt.scatter(X[i], Y[i], color=colorDict[labels[i]])
# Add description information to each point
for i in range(0, 7):
plt.text(X[i]+2, Y[i]-1, names[i], fontsize=14)
plt.show()
Problem analysis: What is the movie type unknown based on known information
Core idea:
The class of an unlabeled sample is determined by the class of its nearest K neighbors
Distance measurement:
The general distance calculation uses Euclidean distance (Pythagorean theorem to calculate distance), and Manhattan distance (sum of horizontal and vertical distances), cosine value and similarity (this is another expression of distance). Compared with the above distance, Mahalanobis distance is more accurate because it can consider many factors, such as unit, which may not exist in the process of finding the inverse matrix of covariance matrix, and if it encounters three dimensions or more, the solution process is extremely complicated, so Mahalanobis distance can not be used
Knowledge expansion
Mahalanobis distance concept: representing covariance distance of data Variance: The average of the square of the distance from each point in the data set to the mean point Standard deviation: the square of variance Covariance cov (x, y): E represents mean value, D represents variance, x and y represent different data sets, and xy represents the corresponding product of data set elements to form a data setcov(x, y) = E(xy) - E(x)*E(y)
cov(x, x) = D(x)
cov(x1+x2, y) = cov(x1, y) + cov(x2, y)
cov(ax, by) = abcov(x, y)
Covariance matrix: According to the matrix composed of dimensions, it is assumed that there are three dimensions, a, b and cΣ ij = [cov (a, a) cov (a, b) cov (a, c) cov (b, a) cov (b, b) cov (b, c) cov (c, a) cov (c, c)]
Algorithm implementation: Euclidean distance
Coding implementation
# Custom implementation mytest1.py
import numpy as np
# Create a dataset
def createDataSet():
features = np.array([[3, 104], [8, 95], [1, 81], [111, 15],
[99, 2], [88, 10]])
labels = ["Romance", "Romance", "Romance", "Action", "Action", "Action"]
return features, labels
def knnClassify(testFeature, trainingSet, labels, k):
"""
KNN The algorithm is realized by Euclidean distance
:param testFeature: Test data set, ndarray Type, 1 Dimensional array
:param trainingSet: Training data set, ndarray Type, 2 Dimensional array
:param labels: The training set corresponds to labels, ndarray Type, 1 Dimensional array
:param k: k Value, int Type
:return: Prediction results, types, and elements in tags 1 To
"""
dataSetsize = trainingSet.shape[0]
"""
Build 1 By dataSet[i] - testFeature New dataset of diffMat
diffMat Each element in the dataSet Each feature in the testFeature Difference of (Euclidean distance median difference)
"""
testFeatureArray = np.tile(testFeature, (dataSetsize, 1))
diffMat = testFeatureArray - trainingSet
# Square each difference
sqDiffMat = diffMat ** 2
# Calculation dataSet Each attribute in the testFeature The sum of the squares of the difference of
sqDistances = sqDiffMat.sum(axis=1)
# Calculate each feature And testFeature Euclidean distance between
distances = sqDistances ** 0.5
"""
Sort, record in order from small to large distances The location of each data in the
Such as distance = [5, 9, 0, 2]
Then sortedStance = [2, 3, 0, 1]
"""
sortedDistances = distances.argsort()
# Choose the one with the smallest distance k Point
classCount = {}
for i in range(k):
voteiLabel = labels[list(sortedDistances).index(i)]
classCount[voteiLabel] = classCount.get(voteiLabel, 0) + 1
# Right k The results are counted and sorted, the final results are selected, and the dictionary is according to value Sort values from large to small
sortedclassCount = sorted(classCount.items(), key=lambda x: x[1], reverse=True)
return sortedclassCount[0][0]
testFeature = np.array([100, 200])
features, labels = createDataSet()
res = knnClassify(testFeature, features, labels, 3)
print(res)
# Use python Package implementation mytest2.py
from sklearn.neighbors import KNeighborsClassifier
from .mytest1 import createDataSet
features, labels = createDataSet()
k = 5
clf = KNeighborsClassifier(k_neighbors=k)
clf.fit(features, labels)
# Sample value
my_sample = [[18, 90]]
res = clf.predict(my_sample)
print(res)
Example: "Dating Site Matching Effect Prediction"
Data source: omitted
Data display
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Data loading
def loadDatingData(file):
datingData = pd.read_table(file, header=None)
datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]
datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])
datingTrainLabel = np.array(datingData["label"])
return datingData, datingTrainData, datingTrainLabel
# 3D Graph display data
def dataView3D(datingTrainData, datingTrainLabel):
plt.figure(1, figsize=(8, 3))
plt.subplot(111, projection="3d")
plt.scatter(np.array([datingTrainData[x][0]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "smallDoses"]),
np.array([datingTrainData[x][1]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "smallDoses"]),
np.array([datingTrainData[x][2]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "smallDoses"]), c="red")
plt.scatter(np.array([datingTrainData[x][0]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "didntLike"]),
np.array([datingTrainData[x][1]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "didntLike"]),
np.array([datingTrainData[x][2]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "didntLike"]), c="green")
plt.scatter(np.array([datingTrainData[x][0]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "largeDoses"]),
np.array([datingTrainData[x][1]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "largeDoses"]),
np.array([datingTrainData[x][2]
for x in range(len(datingTrainLabel))
if datingTrainLabel[x] == "largeDoses"]), c="blue")
plt.xlabel(" Flight mileage ", fontsize=16)
plt.ylabel(" Percentage of video game time spent ", fontsize=16)
plt.clabel(" Ice cream consumption ", fontsize=16)
plt.show()
datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH1)
datingView3D(datingTrainData, datingTrainLabel)
Problem analysis: Extract the first 10% of the data set and test it in the last 90% of the data set
Coding implementation
# Custom method implementation
import pandas as pd
import numpy as np
# Data loading
def loadDatingData(file):
datingData = pd.read_table(file, header=None)
datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]
datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])
datingTrainLabel = np.array(datingData["label"])
return datingData, datingTrainData, datingTrainLabel
# Data return 1 Chemical
def autoNorm(datingTrainData):
# Gets the data set per 1 Maximum value of column
minValues, maxValues = datingTrainData.min(0), datingTrainData.max(0)
diffValues = maxValues - minValues
# Define shapes and datingTrainData Similar minimum and difference matrices
m = datingTrainData.shape(0)
minValuesData = np.tile(minValues, (m, 1))
diffValuesData = np.tile(diffValues, (m, 1))
normValuesData = (datingTrainData-minValuesData)/diffValuesData
return normValuesData
# Core algorithm implementation
def KNNClassifier(testData, trainData, trainLabel, k):
m = trainData.shape(0)
testDataArray = np.tile(testData, (m, 1))
diffDataArray = (testDataArray - trainData) ** 2
sumDataArray = diffDataArray.sum(axis=1) ** 0.5
# Sort the results
sumDataSortedArray = sumDataArray.argsort()
classCount = {}
for i in range(k):
labelName = trainLabel[list(sumDataSortedArray).index(i)]
classCount[labelName] = classCount.get(labelName, 0)+1
classCount = sorted(classCount.items(), key=lambda x: x[1], reversed=True)
return classCount[0][0]
# Data testing
def datingTest(file):
datingData, datingTrainData, datingTrainLabel = loadDatingData(file)
normValuesData = autoNorm(datingTrainData)
errorCount = 0
ratio = 0.10
total = datingTrainData.shape(0)
numberTest = int(total * ratio)
for i in range(numberTest):
res = KNNClassifier(normValuesData[i], normValuesData[numberTest:m], datingTrainLabel, 5)
if res != datingTrainLabel[i]:
errorCount += 1
print("The total error rate is : {}\n".format(error/float(numberTest)))
if __name__ == "__main__":
FILEPATH = "./datingTestSet1.txt"
datingTest(FILEPATH)
# python No. 1 3 Square package implementation
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
if __name__ == "__main__":
FILEPATH = "./datingTestSet1.txt"
datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH)
normValuesData = autoNorm(datingTrainData)
errorCount = 0
ratio = 0.10
total = normValuesData.shape[0]
numberTest = int(total * ratio)
k = 5
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(normValuesData[numberTest:total], datingTrainLabel[numberTest:total])
for i in range(numberTest):
res = clf.predict(normValuesData[i].reshape(1, -1))
if res != datingTrainLabel[i]:
errorCount += 1
print("The total error rate is : {}\n".format(errorCount/float(numberTest)))
The above is python implementation KNN nearest neighbor algorithm details, more about python implementation KNN nearest neighbor algorithm information please pay attention to this site other related articles!