Python implementation of Kmeans++ algorithm example

2020-04-02 13:35:36
OfStack

1. Start with Kmeans

Kmeans is a very basic clustering algorithm, which USES the idea of iteration. I will not mention its principle here. Here's how to use the kmeans algorithm in matlab.

Create 7 2d data points:

x=[randn(3,2)*.4;randn(4,2)*.5+ones(4,1)*[4 4]];

Use the kmeans function:

class = kmeans(x, 2);

X is a data point, and each row of x represents a data point; 2 specifies that there should be 2 center points, that is, the clustering result should have 2 clusters. Class will be a column vector with 70 elements that correspond to 70 data points in turn, with the element value representing the classification number of the corresponding data points. After a run, the value of class is:

This indicates that the first three data points of x belong to cluster 2, and the last four data points belong to cluster 1. The kmeans function can also be used as follows:


>> [class, C, sumd, D] = kmeans(x, 2)
class =
     2
     2
     2
     1
     1
     1
     1
C =
    4.0629    4.0845
   -0.1341    0.1201
sumd =
    1.2017
    0.2939
D =
   34.3727    0.0184
   29.5644    0.1858
   36.3511    0.0898
    0.1247   37.4801
    0.7537   24.0659
    0.1979   36.7666
    0.1256   36.2149

Class still represents the classification of each data point; C contains the final center point, with a line representing a center point; Sumd represents the sum of the distance between each center point and each data point in the cluster. Each row of D also corresponds to a data point. The values in the row in turn are the distances between the data point and each center point. The default distance used by Kmeans is the square value of Euclidean distance (see resources [3]). The distance used by the kmeans function, which can also be the Manhattan distance (L1 - distance), and other types of distance, can be specified by adding parameters.

Kmeans has several shortcomings (which are covered in many documents) :

1. The number of categories of the final cluster (that is, the number of center points or seed points) k cannot be known in advance, so how to choose an appropriate value of k is a problem.
2. The selection of the initial seed points will affect the clustering results.
3. Sensitive to noise and outliers.
4. Wait.

2. Basic idea of kmeans++ algorithm

The main work of kmeans++ algorithm is reflected in the selection of seed points. The basic principle is to make the distance between each seed point as large as possible, but to eliminate the influence of noise. The following are the basic ideas:

1. Randomly select a point as the first cluster center from the input data point set (which requires k clusters)
2. For each point x in the data set, calculate the distance D(x) between it and the nearest clustering center (referring to the selected clustering center)
3. A new data point is selected as the new clustering center. The selection principle is as follows: the larger point D(x) has a higher probability of being selected as the clustering center
4. Repeat 2 and 3 until k clustering centers are selected
5. Use the k initial clustering centers to run the standard k-means algorithm

Suppose that the set of data points X has n data points, using X(1), X(2),... , X(n) means, then, in step 2, calculate the distance between each data point and the nearest seed point (cluster center) in turn, and get D(1), D(2),... And D(n). In D, in order to avoid noise, the element with the largest value cannot be directly selected. Instead, the element with the largest value should be selected and its corresponding data point should be used as the seed point.

How to choose elements with large values? The following is an idea (the original source has not been found yet, and it has been mentioned in data [2] and other places, so the author has changed a way to make himself better understand) : think of each element D(x) in set D as a line L(x), and the length of the line is the value of the element. Arrange the lines according to L(1), L(2),... And L(n) are connected in order to form a long line L. L(1), L(2)... , L(n) is called the subline of L. According to the knowledge of probability, if we randomly select a point on L, the subline of this point is likely to be a relatively long subline, and the data point corresponding to this subline can be used as a seed point. The following two implementations of kmeans++ are both based on this principle.

3. Python version of kmeans++

Can find a variety of programming languages in http://rosettacode.org/wiki/K-means%2B%2B_clustering Kmeans++ implementation. The following is based on the python implementation (the Chinese comments were added by the author) :


from math import pi, sin, cos
from collections import namedtuple
from random import random, choice
from copy import copy
try:
    import psyco
    psyco.full()
except ImportError:
    pass
FLOAT_MAX = 1e100
class Point:
    __slots__ = ["x", "y", "group"]
    def __init__(self, x=0.0, y=0.0, group=0):
        self.x, self.y, self.group = x, y, group
def generate_points(npoints, radius):
    points = [Point() for _ in xrange(npoints)]
    # note: this is not a uniform 2-d distribution
    for p in points:
        r = random() * radius
        ang = random() * 2 * pi
        p.x = r * cos(ang)
        p.y = r * sin(ang)
    return points
def nearest_cluster_center(point, cluster_centers):
    """Distance and index of the closest cluster center"""
    def sqr_distance_2D(a, b):
        return (a.x - b.x) ** 2  +  (a.y - b.y) ** 2
    min_index = point.group
    min_dist = FLOAT_MAX
    for i, cc in enumerate(cluster_centers):
        d = sqr_distance_2D(cc, point)
        if min_dist > d:
            min_dist = d
            min_index = i
    return (min_index, min_dist)
'''
points It's a data point, nclusters Is the given number of cluster classes 
cluster_centers Contains the initialized nclusters Center points, all of which are objects to begin with ->(0,0,0)
'''
def kpp(points, cluster_centers):
    cluster_centers[0] = copy(choice(points)) # I'm going to pick the first center point at random 
    d = [0.0 for _ in xrange(len(points))]  # List of length len(points) , save the distance of each point from the nearest center point 
    for i in xrange(1, len(cluster_centers)):  # i=1...len(c_c)-1
        sum = 0
        for j, p in enumerate(points):
            d[j] = nearest_cluster_center(p, cluster_centers[:i])[1] # The first j Data points p The minimum distance from each center point 
            sum += d[j]
        sum *= random()
        for j, di in enumerate(d):
            sum -= di
            if sum > 0:
                continue
            cluster_centers[i] = copy(points[j])
            break
    for p in points:
        p.group = nearest_cluster_center(p, cluster_centers)[0]
'''
points It's a data point, nclusters Is the given number of cluster classes 
'''
def lloyd(points, nclusters):
    cluster_centers = [Point() for _ in xrange(nclusters)]  # According to the specified number of center points, the center points are initialized (0,0,0)
    # call k++ init
    kpp(points, cluster_centers)   # Select the initial seed point 
    #  The following is kmeans
    lenpts10 = len(points) >> 10
    changed = 0
    while True:
        # group element for centroids are used as counters
        for cc in cluster_centers:
            cc.x = 0
            cc.y = 0
            cc.group = 0
        for p in points:
            cluster_centers[p.group].group += 1  # The number of data points in the same cluster as the seed point 
            cluster_centers[p.group].x += p.x
            cluster_centers[p.group].y += p.y
        for cc in cluster_centers:    # Generate a new center point 
            cc.x /= cc.group
            cc.y /= cc.group
        # find closest centroid of each PointPtr
        changed = 0  # Record the number of data points that changed in the cluster 
        for p in points:
            min_i = nearest_cluster_center(p, cluster_centers)[0]
            if min_i != p.group:
                changed += 1
                p.group = min_i
        # stop when 99.9% of points are good
        if changed <= lenpts10:
            break
    for i, cc in enumerate(cluster_centers):
        cc.group = i
    return cluster_centers
def print_eps(points, cluster_centers, W=400, H=400):
    Color = namedtuple("Color", "r g b");
    colors = []
    for i in xrange(len(cluster_centers)):
        colors.append(Color((3 * (i + 1) % 11) / 11.0,
                            (7 * i % 11) / 11.0,
                            (9 * i % 11) / 11.0))
    max_x = max_y = -FLOAT_MAX
    min_x = min_y = FLOAT_MAX
    for p in points:
        if max_x < p.x: max_x = p.x
        if min_x > p.x: min_x = p.x
        if max_y < p.y: max_y = p.y
        if min_y > p.y: min_y = p.y
    scale = min(W / (max_x - min_x),
                H / (max_y - min_y))
    cx = (max_x + min_x) / 2
    cy = (max_y + min_y) / 2
    print "%%!PS-Adobe-3.0n%%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10)
    print ("/l {rlineto} def /m {rmoveto} defn" +
           "/c { .25 sub exch .25 sub exch .5 0 360 arc fill } defn" +
           "/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " +
           "   gsave 1 setgray fill grestore gsave 3 setlinewidth" +
           " 1 setgray stroke grestore 0 setgray stroke }def")
    for i, cc in enumerate(cluster_centers):
        print ("%g %g %g setrgbcolor" %
               (colors[i].r, colors[i].g, colors[i].b))
        for p in points:
            if p.group != i:
                continue
            print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2,
                                    (p.y - cy) * scale + H / 2))
        print ("n0 setgray %g %g s" % ((cc.x - cx) * scale + W / 2,
                                        (cc.y - cy) * scale + H / 2))
    print "n%%%%EOF"
def main():
    npoints = 30000
    k = 7 # # clusters
    points = generate_points(npoints, 10)
    cluster_centers = lloyd(points, k)
    print_eps(points, cluster_centers)
main()

The algorithm implemented in the above code is for two-dimensional data, so the Point object has three attributes, namely the value on the X-axis, the value on the Y-axis, and the identity of the cluster to which it belongs. The function Lloyd is the overall realization of kmeans++ algorithm. It firstly selects the appropriate seed points through KPP function, and then implements kmeans algorithm to cluster the data sets. The realization of KPP function fully conforms to the basic idea of kmeans++ in steps 2, 3 and 4.

4. Matlab version of kmeans++


function [L,C] = kmeanspp(X,k)
%KMEANS Cluster multivariate data using the k-means++ algorithm.
%   [L,C] = kmeans_pp(X,k) produces a 1-by-size(X,2) vector L with one class
%   label per column in X and a size(X,1)-by-k matrix C containing the
%   centers corresponding to each class.
%   Version: 2013-02-08
%   Authors: Laurent Sorber (Laurent.Sorber@cs.kuleuven.be)
L = [];
L1 = 0;
while length(unique(L)) ~= k
    % The k-means++ initialization.
    C = X(:,1+round(rand*(size(X,2)-1))); %size(X,2) It's a data set X The number of data points, C It's a set of center points 
    L = ones(1,size(X,2));
    for i = 2:k
        D = X-C(:,L); %-1 
        D = cumsum(sqrt(dot(D,D,1))); % Add up the distance between each data point and the center point 
        if D(end) == 0, C(:,i:k) = X(:,ones(1,k-i+1)); return; end
        C(:,i) = X(:,find(rand < D/D(end),1)); %find The second parameter of. Represents the number of returned indexes 
        [~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).')); % Bunkers, this sentence, categorize each data point. 
    end
    % The k-means algorithm.
    while any(L ~= L1)
        L1 = L;
        for i = 1:k, l = L==i; C(:,i) = sum(X(:,l),2)/sum(l); end
        [~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'),[],1);
    end
end

The implementation of this function is somewhat special, the parameter X is the data set, but each column is treated as a data point, and the parameter k is the specified number of clusters. The return value L marks the category to which each data point belongs, and the return value C holds the resulting center point (a column representing a center point). Test it out:


>> x=[randn(3,2)*.4;randn(4,2)*.5+ones(4,1)*[4 4]]
x =
   -0.0497    0.5669
    0.5959    0.2686
    0.5636   -0.4830
    4.3586    4.3634
    4.8151    3.8483
    4.2444    4.1469
    4.5173    3.6064
>> [L, C] = kmeanspp(x',2)
L =
     2     2     2     1     1     1     1
C =
    4.4839    0.3699
    3.9913    0.1175

Okay, so let's start to understand this implementation a little bit, and by the way, consolidate the matlab knowledge.

The unique function is used to get different values in a matrix. For example:


>> unique([1 3 3 4 4 5])
ans =
     1     3     4     5
>> unique([1 3 3 ; 4 4 5])
ans =
     1
     3
     4
     5

So loop while length(unique(L)) ~= k ends with k clusters, but in general, the loop ends once, because the focus is on the loop.

Rand is a random number that returns in the interval (0,1). On the line where comment %-1 is located, C is extended in a way similar to the following:


>> C =[];
>> C(1,1) = 1
C =
     1
>> C(2,1) = 2
C =
     1
     2
>> C(:,[1 1 1 1])
ans =
     1     1     1     1
     2     2     2     2
>> C(:,[1 1 1 1 2])
Index exceeds matrix dimensions.

Element 1 C in the second parameter, in fact, it is on behalf of the C in the first column of the data, in the value 2 when the Index exceeds matrix dimensions. The mistake, because itself has no the second column C. If C has a second column:


>> C(2,2) = 3;
>> C(2,2) = 4;
>> C(:,[1 1 1 1 2])
ans =
     1     1     1     1     3
     2     2     2     2     4

The dot function is to take the dot product of two matrices and then add the results in a certain dimension:


>> TT = [1 2 3 ; 4 5 6];
>> dot(TT,TT)
ans =
    17    29    45
>> dot(TT,TT,1 )
ans =
    17    29    45

< code > cumsum < / code > Is the summation function:


>> cumsum([1 2 3])
ans =
     1     3     6
>> cumsum([1 2 3; 4 5 6])
ans =
     1     2     3
     5     7     9

The Max function returns two values, the second representing the index position of the Max number:


>> [~, L] = max([1 2 3])
L =
     3
>> [~,L] = max([1 2 3;2 3 4])
L =
     2     2     2

Where ~ is a placeholder.

For the bsxfun function, the official documentation states:


C = bsxfun(fun,A,B) applies the element-by-element binary operation specified by the function handle fun to arrays A and B, with singleton expansion enabled

Where the parameter fun is the function handle, see data [9] about the function handle. Here's an example of bsxfun:

>> A= [1 2 3;2 3 4]
A =
     1     2     3
     2     3     4
>> B=[6;7]
B =
     6
     7
>> bsxfun(@minus,A,B)
ans =
    -5    -4    -3
    -5    -4    -3

For:


[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'));

The parameter of Max is such a matrix. The matrix has n columns, and n is also the number of data points. Each column represents the negative number of the distance between the corresponding data point and each center point. But this distance is a bit different, a distortion of Euclidean distance.

Suppose the data point is 2 dimensional, some data point is (x1,y1), some center point is (c1,d1), then through the calculation of bsxfun(@minus,2real(C'X),dot(C,C,1).'), the distance between the data point and the center point is 2c1x1 + 2d1y1 -c1.^2 - c2. For each column, since it's the calculation between the data point and the center point, you can ignore x1. This also explains the rationality of using Max, because the cluster of a data point depends on the center point closest to it. If the distance is negative, it should be the point with the largest value.