python Machine Learning Case Tutorial Implementation of K's Nearest Neighbor algorithm

2020-06-23 00:44:47
OfStack

K's nearest neighbor is one of the classification algorithms, and its explanation is the easiest. He who keeps company with others is the weakest. We just want to see what a person is like and what his friends are like. , of course, also brought to the other, and friends close to behold aspects (object), how is close with friends, eat 1 or 1 shopping is close distance (function), according to the goals and tasks friends how good is not good to judge good is not good (classification algorithm), whether the degree of good friends and under different proximity to consider 1 (weight), see a few friends right (k value), can form good scores (probability distribution).

K nearest neighbor concept:

Its working principle is: there is a sample data set, also known as the training sample set, and there is a label for each data in the sample set, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label with the most similar data (nearest neighbor) in the sample. Generally speaking, we only select the most similar data of k before the sample dataset, which is the origin of k in the ES9en-nearest neighbor algorithm. Generally, k is an integer not greater than 20. Finally, the category with the most occurrences of k's most similar data was selected as the new data category.

Today we use k's nearest neighbor algorithm to build the price model of liquor.

Construct data set

1 wine sample data set was constructed. The price of liquor has a great relationship with rank and age.


from random import random,randint
import math

#  Simulate prices by grade and age 
def wineprice(rating,age):
 peak_age=rating-50

 #  Calculate the price according to the grade 
 price=rating/2
 if age>peak_age:
  #  After "peak years," follow 5 Its quality will deteriorate over the years 
  price=price*(5-(age-peak_age)/2)
 else:
  #  The price will increase to its original value near the "peak year" 5 times 
  price=price*(5*((age+1)/peak_age))
 if price<0: price=0
 return price

#  generate 1 The batch schema data represents the sample data set 
def wineset1():
 rows=[]
 for i in range(300):
  #  Random generation of ages and grades 
  rating=random()*50+50
  age=random()*50

  #  get 1 Individual reference price 
  price=wineprice(rating,age)

  #  add 1 Some noise 
  price*=(random()*0.2+0.9)

  #  Add data set 
  rows.append({'input':(rating,age),'result':price})
 return rows

Distance between data

With k, you first need to know the nearest neighbors, which means you need to know the distance between the data. We use Euclidean distance as the distance between the data.


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Obtain the k sample data closest to the new data


#  Computations are given to forecast commodities and raw data sets 1 Distance between other goods. data The original data set, vec1 Predict commodity 
def getdistances(data,vec1):
 distancelist=[]

 #  Iterate over each of the data sets 1 item 
 for i in range(len(data)):
  vec2=data[i]['input']

  #  Add distance to distance list 
  distancelist.append((euclidean(vec1,vec2),i))

 #  Distance sorting 
 distancelist.sort()
 return distancelist # Return distance list

Predict the properties of the new data based on the most recent k sample data

1. Simply calculate the mean value


#  Before the smallest distance value k Average the results 
def knnestimate(data,vec1,k=5):
 #  You get the sorted distance 
 dlist=getdistances(data,vec1)
 avg=0.0

 #  The former k Average the results 
 for i in range(k):
  idx=dlist[i][1]
  avg+=data[idx]['result']
 avg=avg/k
 return avg

2. Calculate the weighted average

If the mean value is calculated directly, there may be the nearest neighbor of the first k, and the data may be far away, but it still belongs to the latest first K data. When such a situation exists, there will be a deviation in the direct mean of the previous k sample data, so the weighted average is used to assign smaller weights to distant nodes and larger weights to nearby nodes.

So how do you distribute weights based on the distance between the data? There are three decrement functions used as weight allocation methods.

2.1. Use inverse function to assign weights to the nearest neighbor.


def inverseweight(dist,num=1.0,const=0.1):
 return num/(dist+const)

2.2. Use the subtraction function to assign weights to the nearest neighbors. Not available when the nearest distance is greater than const.


def subtractweight(dist,const=1.0):
 if dist>const:
  return 0
 else:
  return const-dist

2.3. Gaussian function is used to assign weight to the distance.


def gaussian(dist,sigma=5.0):
 return math.e**(-dist**2/(2*sigma**2))

Now that you have the weight allocation method, you can calculate the weighted average.


#  Before the smallest distance value k The weighted average of the results was obtained 
def weightedknn(data,vec1,k=5,weightf=gaussian):
 #  Get the distance value 
 dlist=getdistances(data,vec1)
 avg=0.0
 totalweight=0.0

 #  Get the weighted average 
 for i in range(k):
  dist=dlist[i][0]
  idx=dlist[i][1]
  weight=weightf(dist)
  avg+=weight*data[idx]['result']
  totalweight+=weight
 if totalweight==0: return 0
 avg=avg/totalweight
 return avg

The first test

Having completed the function of using k's nearest neighbor to make new data predictions, let's test it.


if __name__=='__main__': # This function is only run when the current module is executed 
 data = wineset1()  # Create the first 1 Number of data sets 
 result=knnestimate(data,(95.0,3.0)) # The prediction is based on the nearest neighbor average 
 print(result)

 result=weightedknn(data,(95.0,3.0),weightf=inverseweight) # The inverse function is used for weight allocation and weighted average 
 print(result)
 result = weightedknn(data, (95.0, 3.0), weightf=subtractweight) #  Use the subtraction function to do the weight allocation method, the weighted average 
 print(result)
 result = weightedknn(data, (95.0, 3.0), weightf=gaussian) #  Gaussian function is used to assign weights and make weighted average 
 print(result)

Cross validation

Cross validation is used to verify your algorithm or algorithm parameters, such as the above weighted allocation algorithm we have three ways, which is better? We can use cross-validation to see.

95% of the sample data set was randomly selected as the training set and 5% as the new data. The new data were predicted and compared with the known results to see the effect of the algorithm.

To achieve cross-validation, the sample data set should be divided into two subsets of training set and new data.


#  Divide the data. test The proportion of test data sets. Other data sets are training data 
def dividedata(data,test=0.05):
 trainset=[]
 testset=[]
 for row in data:
  if random()<test:
   testset.append(row)
  else:
   trainset.append(row)
 return trainset,testset

You also need to be able to apply algorithms to calculate the degree of error between the predicted results and the actual results.


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

We have data splitting and algorithm performance error statistics. So we can do this multiple times on the original data set, statistical mean error.


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Cross validation test


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Variable of different type, range, useless variable

May not be in the sample data of each attribute value range is the same of the same type of data, such as the properties of above may contain class (0-100), the fixed number of year of wine (0 to 50), the capacity of wine (three capacity 375.0 ml ml, ml 750.0, 1500.0), even after we get the sample in the data and may contain useless data, such as wine production line number (integer between 1-20). When calculating the sample distance, the change of the attribute with a large value range will seriously affect the change of the attribute with a small value range, so that the result will ignore the attribute with a small value range. Also, changes in useless attributes can increase the distance between data.

So you need to scale the properties of the sample data to the appropriate range, and you need to be able to remove invalid properties.

Construct a new data set


#  Build new data sets to simulate problems with different types of variables 
def wineset2():
 rows=[]
 for i in range(300):
  rating=random()*50+50 # The grade of the wine 
  age=random()*50   # The fixed number of year of wine 
  aisle=float(randint(1,20)) # Channel number of wine (irrelevant property) 
  bottlesize=[375.0,750.0,1500.0][randint(0,2)] # The capacity of wine 
  price=wineprice(rating,age) # The price of the wine 
  price*=(bottlesize/750)
  price*=(random()*0.2+0.9)
  rows.append({'input':(rating,age,aisle,bottlesize),'result':price})
 return rows

Realize the function of scaling the value of the attribute


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

So that leaves us with one last question, how much do we scale each of the properties. This is an optimization problem, and we can find the optimal solution through optimization techniques. The cost function that needs to be optimized is the error value between the predicted result after scaling and the real result. The smaller the error, the better. The error value is calculated using the same crossvalidate function used for the previous cross validation

Let's build the cost function


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Optimization techniques can see https: / / www ofstack. com article / 131719. htm

The test code


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Asymmetric distribution

If the sample data set contains multiple distributions, we also hope that there will not be only one output. We can use the probability result to represent, output the value of each result and the probability of occurrence.

For example, wine may have been purchased from a discount store, but this feature is not recorded in the sample data set. So there are two forms of distribution of prices in the sample data.

Construct data set


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

If it exists as a result probability, we need to be able to calculate the probability values for the specified range


#  Calculate the probability. data Sample data set, vec1 The forecast data, low . high The resulting range, weightf A function that assigns weights based on distance 
def probguess(data,vec1,low,high,k=5,weightf=gaussian):
 dlist=getdistances(data,vec1) # Get distance list 
 nweight=0.0
 tweight=0.0

 for i in range(k):
  dist=dlist[i][0] # distance 
  idx=dlist[i][1] # Reference no. 
  weight=weightf(dist) # A weight 
  v=data[idx]['result'] # Real results 

  #  Is the current data point in the specified range? 
  if v>=low and v<=high:
   nweight+=weight # Specifies the sum of the weights of the range 
  tweight+=weight  # Sum of the total weights 
 if tweight==0: return 0

 #  The probability is equal to the weight within the specified range divided by the weight of ownership 
 return nweight/tweight

For multiple outputs, expressed as probabilities and values, we can use the form of a cumulative probability distribution or probability density map.

Draw the cumulative probability distribution


#  I'm going to use Euclidean distance, and I'm going to define distance 
def euclidean(v1,v2):
 d=0.0
 for i in range(len(v1)):
  d+=(v1[i]-v2[i])**2
 return math.sqrt(d)

Draw a probability density plot


#  Draw a probability density plot 
def probabilitygraph(data,vec1,high,k=5,weightf=gaussian,ss=5.0):
 #  To establish 1 A range of prices 
 t1=arange(0.0,high,0.1)

 #  You get all the probabilities over the entire range 
 probs=[probguess(data,vec1,v,v+0.1,k,weightf) for v in t1]

 #  The probability value is smoothed by adding the gaussian calculation result of nearest neighbor probability 
 smoothed=[]
 for i in range(len(probs)):
  sv=0.0
  for j in range(0,len(probs)):
   dist=abs(i-j)*0.1
   weight=gaussian(dist,sigma=ss)
   sv+=weight*probs[j]
  smoothed.append(sv)
 smoothed=array(smoothed)

 plot(t1,smoothed)
 show()

The test code


if __name__=='__main__': # This function is only run when the current module is executed 

 data = wineset3() #  Create the first 3 Number of data sets 
 print(data)
 cumulativegraph(data,(1,1),120) # Plot the cumulative probability density 
 probabilitygraph(data,(1,1),6) # Draw a probability density plot