Python implements KNN proximity algorithm

2020-06-19 10:58:28
OfStack

Introduction to the

Proximity algorithm, or K nearest neighbor (kNN, ES6en-ES7en) classification algorithm is one of the simplest methods in data mining classification technology. K's nearest neighbors means k's nearest neighbors, meaning that each sample can be represented by its closest k neighbors.

The core idea of kNN algorithm is that if the majority of k's most adjacent samples in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of samples on this category. This method only determines the category of the sample to be divided according to the category of the nearest one or several samples. The kNN method only concerns a very small number of adjacent samples when making category decisions. The kNN method is more suitable than other methods for the sample sets to be divided which have more overlapping or overlapping class domains, because it mainly relies on the limited neighboring samples rather than the method of discriminating class domains to determine the category.

In this paper, Python and numpy libraries are used to implement the core algorithm of KNN, and a simple example is used for verification.

Implementation of KNN core algorithm

For the implementation of KNN algorithm, we first calculate the Euclidean space distance, and then sort by distance to find the nearest k classification.


from numpy import tile 
import operator 
 
def do_knn_classifier(in_array, data_set, labels, k): 
 ''''' 
 classify the in_array according the data set and labels 
 ''' 
 
 # Calculate the right distance  
 data_set_size = data_set.shape[0] 
 diff_matrix = tile(in_array, (data_set_size, 1)) - data_set 
 sq_diff_matrix = diff_matrix ** 2 
 sq_distance = sq_diff_matrix.sum(axis=1) 
 distances = sq_distance ** 0.5 
 
 #argsort The function returns the index value of the array from the smallest to the largest,   Distance sorting  
 sorted_dist_indicies = distances.argsort() 
  
 #  choose K A close to  
 class_count = {} 
 for i in range(k): 
 vote_label = labels[sorted_dist_indicies[i]] 
 class_count[vote_label] = class_count.get(vote_label, 0) + 1 
 
 # Sort, and return the most adjacent category  
 sorted_class_count = sorted(class_count.iteritems(), key=operator.itemgetter(1), reverse=True) 
 
 return sorted_class_count[0][0]

Normalization of values

In most cases, the range of eigenvalues selected is relatively large. When dealing with the eigenvalues of these different value ranges, the usual approach is to normalize the value, such as treating the value range between 0 and 1 or -1 and 1. The following formula can convert the eigenvalue of any value range to the value in the interval from 0 to 1:
newValue = (oldValue - min) / (max - min)
min and max are the minimum and maximum eigenvalues in the data set respectively.


from numpy import tile 
import operator 
 
def auto_normalize_data(data_set): 
 ''''' 
  Return the data set 1 operation  
 ''' 
 #  parameter 0 Enables a function to pick the minimum value from a column instead of the minimum value for the current row  
 min_vals = data_set.min(0) 
 max_vals = data_set.max(0) 
 ranges = max_vals - min_vals 
 
 
 #  Belong to the 1 Treatment,  
 m = data_set.shape[0] 
 norm_data_set = data_set - tile(min_vals, (m, 1)) 
 norm_data_set = norm_data_set / tile(ranges, (m, 1)) 
 
 return norm_data_set, ranges, min_vals

The instance

End this article with a simple example. There is no need to normalize the data.


from numpy import array 
from knn.knn_classifier import do_knn_classifier 
 
def get_data_set(): 
 ''''' 
 Get data set and labels 
 ''' 
 group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) 
 labels = ['A', 'A', 'B', 'B'] 
 
 return group, labels 
 
if __name__ == '__main__': 
 data_set, labels = get_data_set() 
 
 t = do_knn_classifier(array([0.2, 0.1]), data_set, labels, 3) 
 print t