Python code implements KNN algorithm

2020-06-19 10:35:48
OfStack

kNN algorithm is the abbreviation of k-nearest neighbor algorithm, which is mainly used for classification practice. The main ideas are as follows:

1. There is a training data set, and each data has a corresponding label. In other words, we know each data in the sample set and its corresponding category.
2. When a new data is entered for classification or label determination, each eigenvalue of the new data is compared with each data in the training data set, and the distance from each point in the training data set is calculated (Euclidean distance is used for the following code implementation).
3. Then extract the labels or categories corresponding to the k training data points closest to the new data.
4. The label or category that appears most often is the label or category that is currently forecasting new data.

The Euclidean distance formula is:

distance= sqrt ((xA0-XB0) ^2+(xA1-XB1) ^2+... +(ES22en-ES23en) ^2)(if the data has n feature items)

The following is the code implementation:


#! /usr/bin/python 
#coding=utf-8 
from numpy import * 
import operator 
def createDataSet(): 
  group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])# Training data collection  
  labels = ['A','A','B','B']# The categories corresponding to the training data  
  return group,labels 
''''' 
inX: The input vector used for classification  
dataSet : Training sample set  
labels : Label vector  
k : k- In the nearest neighbor algorithm k 
''' 
def classify0(inX,dataSet,labels,k): 
  dataSetSize = dataSet.shape[0] # Gets the dimension of the array, that is, the number of rows (sample number) of the training sample; if gets the number of columns, is shape[1] 
  diffMat = tile(inX,(dataSetSize,1)) - dataSet # tile  said inX In a repeated dataSetSize Line, 1 The column. To prepare the Euclidean distance between the input vector and each sample.  
  sqDiddMat = diffMat**2 #diffMat Is the input vector subtracted from each point of our training sample, **2 The result of the value is squared.  
  sqDistances = sqDiddMat.sum(axis=1)# The default is axis=0 . axis=1 The future will be 1 Each of these matrices 1 Addition of row vectors  
  distances = sqDistances**0.5 # The Euclidean distance between the input vector and the midpoint of each training sample is obtained by taking the square root of the result  
  sorteDistIndicies = distances.argsort()# Sort the distance results from smallest to largest to get the index value  
  classcount={} # This is a 1 A dictionary, key As the category, value Is the front with the smallest distance k Is the number of samples in this class.  
  for i in range(k): 
    voteIlabel = labels[sorteDistIndicies[i]]# Get the smallest distance before k I have a sample point label value  
    classcount[voteIlabel] = classcount.get(voteIlabel,0)+1 # If I take the previous sample point label If the value is the same as the present value, the sum is added 1 Otherwise, this time 1 
  sorteClassCount = sorted(classcount.iteritems(),key=operator.itemgetter(1),reverse=True) # for calsscount Gets the control of the object 1 The values of the fields are sorted in descending order. In other words, sort by the number of categories from largest to smallest.  
  return sorteClassCount[0][0] # Returns the row of the sorted dictionary 1 An element of key , the category after classification  
 
createDataSet() 
print classify0([0.9,0.9],group,labels,3)

The result is A