python Machine Learning Case Tutorial Implementation of K's Nearest Neighbor algorithm
- 2020-06-23 00:44:47
- OfStack
K's nearest neighbor is one of the classification algorithms, and its explanation is the easiest. He who keeps company with others is the weakest. We just want to see what a person is like and what his friends are like. , of course, also brought to the other, and friends close to behold aspects (object), how is close with friends, eat 1 or 1 shopping is close distance (function), according to the goals and tasks friends how good is not good to judge good is not good (classification algorithm), whether the degree of good friends and under different proximity to consider 1 (weight), see a few friends right (k value), can form good scores (probability distribution).
K nearest neighbor concept:
Its working principle is: there is a sample data set, also known as the training sample set, and there is a label for each data in the sample set, that is, we know the corresponding relationship between each data in the sample set and its classification. After entering the new data without labels, each feature of the new data is compared with the corresponding feature of the data in the sample set, and then the algorithm extracts the classification label with the most similar data (nearest neighbor) in the sample. Generally speaking, we only select the most similar data of k before the sample dataset, which is the origin of k in the ES9en-nearest neighbor algorithm. Generally, k is an integer not greater than 20. Finally, the category with the most occurrences of k's most similar data was selected as the new data category.
Today we use k's nearest neighbor algorithm to build the price model of liquor.
Construct data set
1 wine sample data set was constructed. The price of liquor has a great relationship with rank and age.
from random import random,randint
import math
# Simulate prices by grade and age
def wineprice(rating,age):
peak_age=rating-50
# Calculate the price according to the grade
price=rating/2
if age>peak_age:
# After "peak years," follow 5 Its quality will deteriorate over the years
price=price*(5-(age-peak_age)/2)
else:
# The price will increase to its original value near the "peak year" 5 times
price=price*(5*((age+1)/peak_age))
if price<0: price=0
return price
# generate 1 The batch schema data represents the sample data set
def wineset1():
rows=[]
for i in range(300):
# Random generation of ages and grades
rating=random()*50+50
age=random()*50
# get 1 Individual reference price
price=wineprice(rating,age)
# add 1 Some noise
price*=(random()*0.2+0.9)
# Add data set
rows.append({'input':(rating,age),'result':price})
return rows
Distance between data
With k, you first need to know the nearest neighbors, which means you need to know the distance between the data. We use Euclidean distance as the distance between the data.
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
Obtain the k sample data closest to the new data
# Computations are given to forecast commodities and raw data sets 1 Distance between other goods. data The original data set, vec1 Predict commodity
def getdistances(data,vec1):
distancelist=[]
# Iterate over each of the data sets 1 item
for i in range(len(data)):
vec2=data[i]['input']
# Add distance to distance list
distancelist.append((euclidean(vec1,vec2),i))
# Distance sorting
distancelist.sort()
return distancelist # Return distance list
Predict the properties of the new data based on the most recent k sample data
1. Simply calculate the mean value
# Before the smallest distance value k Average the results
def knnestimate(data,vec1,k=5):
# You get the sorted distance
dlist=getdistances(data,vec1)
avg=0.0
# The former k Average the results
for i in range(k):
idx=dlist[i][1]
avg+=data[idx]['result']
avg=avg/k
return avg
2. Calculate the weighted average
If the mean value is calculated directly, there may be the nearest neighbor of the first k, and the data may be far away, but it still belongs to the latest first K data. When such a situation exists, there will be a deviation in the direct mean of the previous k sample data, so the weighted average is used to assign smaller weights to distant nodes and larger weights to nearby nodes.
So how do you distribute weights based on the distance between the data? There are three decrement functions used as weight allocation methods.
2.1. Use inverse function to assign weights to the nearest neighbor.
def inverseweight(dist,num=1.0,const=0.1):
return num/(dist+const)
2.2. Use the subtraction function to assign weights to the nearest neighbors. Not available when the nearest distance is greater than const.
def subtractweight(dist,const=1.0):
if dist>const:
return 0
else:
return const-dist
2.3. Gaussian function is used to assign weight to the distance.
def gaussian(dist,sigma=5.0):
return math.e**(-dist**2/(2*sigma**2))
Now that you have the weight allocation method, you can calculate the weighted average.
# Before the smallest distance value k The weighted average of the results was obtained
def weightedknn(data,vec1,k=5,weightf=gaussian):
# Get the distance value
dlist=getdistances(data,vec1)
avg=0.0
totalweight=0.0
# Get the weighted average
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
if totalweight==0: return 0
avg=avg/totalweight
return avg
The first test
Having completed the function of using k's nearest neighbor to make new data predictions, let's test it.
if __name__=='__main__': # This function is only run when the current module is executed
data = wineset1() # Create the first 1 Number of data sets
result=knnestimate(data,(95.0,3.0)) # The prediction is based on the nearest neighbor average
print(result)
result=weightedknn(data,(95.0,3.0),weightf=inverseweight) # The inverse function is used for weight allocation and weighted average
print(result)
result = weightedknn(data, (95.0, 3.0), weightf=subtractweight) # Use the subtraction function to do the weight allocation method, the weighted average
print(result)
result = weightedknn(data, (95.0, 3.0), weightf=gaussian) # Gaussian function is used to assign weights and make weighted average
print(result)
Cross validation
Cross validation is used to verify your algorithm or algorithm parameters, such as the above weighted allocation algorithm we have three ways, which is better? We can use cross-validation to see.
95% of the sample data set was randomly selected as the training set and 5% as the new data. The new data were predicted and compared with the known results to see the effect of the algorithm.
To achieve cross-validation, the sample data set should be divided into two subsets of training set and new data.
# Divide the data. test The proportion of test data sets. Other data sets are training data
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset
You also need to be able to apply algorithms to calculate the degree of error between the predicted results and the actual results.
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
0
We have data splitting and algorithm performance error statistics. So we can do this multiple times on the original data set, statistical mean error.
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
1
Cross validation test
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
2
Variable of different type, range, useless variable
May not be in the sample data of each attribute value range is the same of the same type of data, such as the properties of above may contain class (0-100), the fixed number of year of wine (0 to 50), the capacity of wine (three capacity 375.0 ml ml, ml 750.0, 1500.0), even after we get the sample in the data and may contain useless data, such as wine production line number (integer between 1-20). When calculating the sample distance, the change of the attribute with a large value range will seriously affect the change of the attribute with a small value range, so that the result will ignore the attribute with a small value range. Also, changes in useless attributes can increase the distance between data.
So you need to scale the properties of the sample data to the appropriate range, and you need to be able to remove invalid properties.
Construct a new data set
# Build new data sets to simulate problems with different types of variables
def wineset2():
rows=[]
for i in range(300):
rating=random()*50+50 # The grade of the wine
age=random()*50 # The fixed number of year of wine
aisle=float(randint(1,20)) # Channel number of wine (irrelevant property)
bottlesize=[375.0,750.0,1500.0][randint(0,2)] # The capacity of wine
price=wineprice(rating,age) # The price of the wine
price*=(bottlesize/750)
price*=(random()*0.2+0.9)
rows.append({'input':(rating,age,aisle,bottlesize),'result':price})
return rows
Realize the function of scaling the value of the attribute
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
4
So that leaves us with one last question, how much do we scale each of the properties. This is an optimization problem, and we can find the optimal solution through optimization techniques. The cost function that needs to be optimized is the error value between the predicted result after scaling and the real result. The smaller the error, the better. The error value is calculated using the same crossvalidate function used for the previous cross validation
Let's build the cost function
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
5
Optimization techniques can see https: / / www ofstack. com article / 131719. htm
The test code
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
6
Asymmetric distribution
If the sample data set contains multiple distributions, we also hope that there will not be only one output. We can use the probability result to represent, output the value of each result and the probability of occurrence.
For example, wine may have been purchased from a discount store, but this feature is not recorded in the sample data set. So there are two forms of distribution of prices in the sample data.
Construct data set
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
7
If it exists as a result probability, we need to be able to calculate the probability values for the specified range
# Calculate the probability. data Sample data set, vec1 The forecast data, low . high The resulting range, weightf A function that assigns weights based on distance
def probguess(data,vec1,low,high,k=5,weightf=gaussian):
dlist=getdistances(data,vec1) # Get distance list
nweight=0.0
tweight=0.0
for i in range(k):
dist=dlist[i][0] # distance
idx=dlist[i][1] # Reference no.
weight=weightf(dist) # A weight
v=data[idx]['result'] # Real results
# Is the current data point in the specified range?
if v>=low and v<=high:
nweight+=weight # Specifies the sum of the weights of the range
tweight+=weight # Sum of the total weights
if tweight==0: return 0
# The probability is equal to the weight within the specified range divided by the weight of ownership
return nweight/tweight
For multiple outputs, expressed as probabilities and values, we can use the form of a cumulative probability distribution or probability density map.
Draw the cumulative probability distribution
# I'm going to use Euclidean distance, and I'm going to define distance
def euclidean(v1,v2):
d=0.0
for i in range(len(v1)):
d+=(v1[i]-v2[i])**2
return math.sqrt(d)
9
Draw a probability density plot
# Draw a probability density plot
def probabilitygraph(data,vec1,high,k=5,weightf=gaussian,ss=5.0):
# To establish 1 A range of prices
t1=arange(0.0,high,0.1)
# You get all the probabilities over the entire range
probs=[probguess(data,vec1,v,v+0.1,k,weightf) for v in t1]
# The probability value is smoothed by adding the gaussian calculation result of nearest neighbor probability
smoothed=[]
for i in range(len(probs)):
sv=0.0
for j in range(0,len(probs)):
dist=abs(i-j)*0.1
weight=gaussian(dist,sigma=ss)
sv+=weight*probs[j]
smoothed.append(sv)
smoothed=array(smoothed)
plot(t1,smoothed)
show()
The test code
if __name__=='__main__': # This function is only run when the current module is executed
data = wineset3() # Create the first 3 Number of data sets
print(data)
cumulativegraph(data,(1,1),120) # Plot the cumulative probability density
probabilitygraph(data,(1,1),6) # Draw a probability density plot