C language implements ES1en ES2en algorithm

  • 2020-06-03 07:36:27
  • OfStack

1. Clustering and clustering algorithm

Clustering means that data objects are divided into several classes. Objects in the same class have high similarity while different classes have low similarity. The clustering algorithm divides the data set into several interrelated classes so as to realize the deep analysis of the data and the preliminary processing stage of data value mining. For example, in the field of modern business, clustering analysis algorithm can learn consumers' consumption habits and consumption tendency from a large data set, so as to facilitate decision makers to formulate consumption strategies. In short, as a module in data mining, cluster analysis algorithm can be used as a separate tool to discover some deep information distributed in the database and summarize the characteristics of each class. Cluster analysis algorithm can also be used as a preprocessing step of other analysis algorithms in data mining algorithm.

In the field of data mining, clustering analysis algorithms can be divided into 1 categories, including division method, hierarchical method, density-based method, network-based method and model-based method. The basic idea based on partition is to divide the data set containing N data objects into K clusters by iterative method. Specific steps is that users should be divided by the number of is given first, and then through the repeated iteration 1 set algorithm, makes each group than the former one is more close to the expected goal, criteria to judge whether the optimization is the similarity between different data between group data, group data similarity degree, the greater the degree of groups like the smaller optimization.

The core idea of ES9en-ES10en clustering algorithm is based on the division of data sets. It divides N data objects into K classes to minimize the sum of squares of distance between data points in each class and the clustering center. Next, I will use C language to implement the ES14en-ES15en algorithm, and test the algorithm in three aspects: input different number of clustering, change the density of data points, and select the initial clustering center.

2. Implementation steps of ES18en-ES19en algorithm

Based on the understanding of clustering and K-ES23en algorithm, the realization process of C language algorithm is as follows:

(1) Enter N data points through the file and select K (K) < N) data point as the initial clustering center;

(2) Calculate the Euclidean distance of the remaining data points to the center of each clustering point, and divide the points into the nearest class;

(3) Recalculate the clustering center of each cluster;

(4) Compared with the previous clustering center, if the clustering center changes, go to (2); otherwise, finish the superposition and output the result.

3. Implementation of ES40en-means algorithm

(1) Realization of ideas

Through the above understanding of ES46en-ES47en algorithm, this algorithm mainly solves the center of K clustering through the idea of iteration. Because traditional arrays need to be defined before they are used, and the dynamic increase of array length cannot be realized in the process of using. Considering the design of the algorithm at the same time, did not involve in the process of iteration each data point insertion and deletion, specific to each data point in the clustering, is identified by className of structure member variables, therefore chose Vector as a container for storing data, such as a large amount of data from a file input, storage space by program channel out their own needs. Meanwhile, size and iterator methods provided by Vector vector container can also be used to achieve traversal and output according to the cluster.

Each data point contains X and Y coordinates. At the initial state of the algorithm, the specific number of clustering K is specified, and the K clustering centers in the initial state are specified by the previous K data points in the input file. In each iteration, the algorithm needs to calculate the Euclidean distance between each point and K cluster center coordinates, select the nearest cluster, and identify the current data point with the name of the cluster. After all data points are traversed, the mean values of all data points X and Y are calculated and compared with the coordinates of the center point of the previous clustering. When the error between X and Y is less than or equal to 1ES65en-6, the iteration is finished and the central coordinate of the convergent K song clustering is output.

(2) Description of variables and functions

(1) Define the structure type, which is used to store the coordinates of data points, the clustering location, and the distance from the clustering center


typedef struct point

{

float x,y;    // Coordinates of data points 
string className; // Belongs to the cluster 
float distance;  // Distance from the cluster center 

}Point;

(2) Variable declaration

vector < Point > dataVector: Stores data read from a file

vector < Point > classPoints: Store the cluster coordinates

vector < Point > & totalPoints) : Stores all data points

(3) Function declaration

String conversion function: Converts an integer variable to a string type:


string converToString(int x);

Read data Function: Read coordinate data from a file:


vector<Point> readDataFile(string fileName);

Initialize data set function:


void initDataset(int classNum,vector<Point> dataVector,vector<Point> &classPoints,vector<Point> &totalPoints);

Calculate the Euclidean distance between each data point and the center of the clustering point:


string computerDistance(Point *p_totalPoints,vector<Point> &classPoints);

Functions that divide each point into the corresponding class:


void kMeansClustering(int classNum,vector<Point> totalPoints,vector<Point> classPoints);

(3) Core Code (part)

(1) Initialize the data set function:


void initDataset(int classNum,vector<Point>dataVector,vector<Point>&classPoints, 
         vector<Point>&totalPoints) 
{ 
  int i,j; 
  Point point; 
  for(i=0,j=1; i<dataVector.size(); i++) 
  { 
    if(j<=classNum) //classNum Represents the number of the cluster  
    { 
      point.x=dataVector[i].x; 
      point.y=dataVector[i].y; 
      point.distance=dataVector[i].distance; 
      point.className=converToString(j);// Converts an integer type to a string type  
      classPoints.push_back(point); 
      j++; 
    } 
    point.x=dataVector[i].x; 
    point.y=dataVector[i].y; 
    point.distance=dataVector[i].distance; 
    totalPoints.push_back(point); 
  } 
} 

(2) K-means function:


void kMeansClustering(int classNum,vector<Point> totalPoints,vector<Point> classPoints) 
{ 
  float tempX=0;// Calculate all data points in the cluster X The average of  
  float tempY=0;// Calculate all data points in the cluster Y The average of  
  int count=0; // Record every 1 The number of data points in the class  
  float errorX=INT_MAX; // So let's say we have the maximum error at the beginning  
  float errorY=INT_MAX; 
  vector<Point>::iterator p_totalPoints; 
  vector<Point>::iterator p_classPoints; 
  Point temp; 
  int i; 
  while(errorX > 1e-6 && errorY > 1e-6) 
  { 
    for(p_totalPoints=totalPoints.begin(); p_totalPoints!=totalPoints.end(); p_totalPoints++) 
    { 
      // Sort all the points nearby  
      string className=computerDistance(p_totalPoints,classPoints); 
      (*p_totalPoints).className=className; 
    } 
    errorX=0; 
    errorY=0; 
    // Cluster center points are redivided according to the mean value  
    for(p_classPoints=classPoints.begin(); p_classPoints!=classPoints.end(); p_classPoints++) 
    { 
      count=0; 
      tempX=0; 
      tempY=0; 
      cout<<"Partition to cluster center "<<p_classPoints->className<<":"; 
      for(p_totalPoints=totalPoints.begin(); p_totalPoints!=totalPoints.end(); p_totalPoints++) 
      { 
        if((*p_totalPoints).className==(*p_classPoints).className) 
        { 
          cout<<" ("<<(*p_totalPoints).x<<","<<(*p_totalPoints).y<<") "; 
          count++; 
          tempX+=(*p_totalPoints).x; 
          tempY+=(*p_totalPoints).y; 
        } 
      } 
      cout<<endl; 
      tempX /=count; 
      tempY /=count; 
      errorX +=fabs(tempX - (*p_classPoints).x); 
      errorY +=fabs(tempY - (*p_classPoints).y); 
      // To calculate X with Y The mean  
      (*p_classPoints).x=tempX; 
      (*p_classPoints).y=tempY; 
    } 
    int i=0; 
    for(p_classPoints=classPoints.begin(); p_classPoints!=classPoints.end(); p_classPoints++,i++) 
    { 
      cout<<"Cluster center "<<i+1<<": x="<<(*p_classPoints).x<<" y="<<(*p_classPoints).y<<endl; 
    } 
    cout<<"-----------------------------------------------------------------"<<endl; 
  } 
  cout<<"Result value convergence"<<endl; 
  i=0; 
  for(p_classPoints=classPoints.begin(); p_classPoints!=classPoints.end(); p_classPoints++,i++) 
  { 
    cout<<"Cluster center "<<i+1<<": x="<<(*p_classPoints).x<<" y="<<(*p_classPoints).y<<endl; 
  } 
  cout<<"-----------------------------------------------------------------"<<endl; 
} 

Related articles: