Python implements KNN proximity algorithm
- 2020-06-19 10:58:28
- OfStack
Introduction to the
Proximity algorithm, or K nearest neighbor (kNN, ES6en-ES7en) classification algorithm is one of the simplest methods in data mining classification technology. K's nearest neighbors means k's nearest neighbors, meaning that each sample can be represented by its closest k neighbors.
The core idea of kNN algorithm is that if the majority of k's most adjacent samples in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of samples on this category. This method only determines the category of the sample to be divided according to the category of the nearest one or several samples. The kNN method only concerns a very small number of adjacent samples when making category decisions. The kNN method is more suitable than other methods for the sample sets to be divided which have more overlapping or overlapping class domains, because it mainly relies on the limited neighboring samples rather than the method of discriminating class domains to determine the category.
In this paper, Python and numpy libraries are used to implement the core algorithm of KNN, and a simple example is used for verification.
Implementation of KNN core algorithm
For the implementation of KNN algorithm, we first calculate the Euclidean space distance, and then sort by distance to find the nearest k classification.
from numpy import tile
import operator
def do_knn_classifier(in_array, data_set, labels, k):
'''''
classify the in_array according the data set and labels
'''
# Calculate the right distance
data_set_size = data_set.shape[0]
diff_matrix = tile(in_array, (data_set_size, 1)) - data_set
sq_diff_matrix = diff_matrix ** 2
sq_distance = sq_diff_matrix.sum(axis=1)
distances = sq_distance ** 0.5
#argsort The function returns the index value of the array from the smallest to the largest, Distance sorting
sorted_dist_indicies = distances.argsort()
# choose K A close to
class_count = {}
for i in range(k):
vote_label = labels[sorted_dist_indicies[i]]
class_count[vote_label] = class_count.get(vote_label, 0) + 1
# Sort, and return the most adjacent category
sorted_class_count = sorted(class_count.iteritems(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]
Normalization of values
In most cases, the range of eigenvalues selected is relatively large. When dealing with the eigenvalues of these different value ranges, the usual approach is to normalize the value, such as treating the value range between 0 and 1 or -1 and 1. The following formula can convert the eigenvalue of any value range to the value in the interval from 0 to 1:
newValue = (oldValue - min) / (max - min)
min and max are the minimum and maximum eigenvalues in the data set respectively.
from numpy import tile
import operator
def auto_normalize_data(data_set):
'''''
Return the data set 1 operation
'''
# parameter 0 Enables a function to pick the minimum value from a column instead of the minimum value for the current row
min_vals = data_set.min(0)
max_vals = data_set.max(0)
ranges = max_vals - min_vals
# Belong to the 1 Treatment,
m = data_set.shape[0]
norm_data_set = data_set - tile(min_vals, (m, 1))
norm_data_set = norm_data_set / tile(ranges, (m, 1))
return norm_data_set, ranges, min_vals
The instance
End this article with a simple example. There is no need to normalize the data.
from numpy import array
from knn.knn_classifier import do_knn_classifier
def get_data_set():
'''''
Get data set and labels
'''
group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = ['A', 'A', 'B', 'B']
return group, labels
if __name__ == '__main__':
data_set, labels = get_data_set()
t = do_knn_classifier(array([0.2, 0.1]), data_set, labels, 3)
print t