Sorting unduplicated data set instance code with bitmaps (C++ version)

  • 2020-04-02 01:53:42
  • OfStack

The first chapter of "Programming Pearls" (link: #) describes how to use bitmaps to sort data sets without duplicates.

One, main idea

Bitmap sort of thought is in memory for a continuous space as a bitmap, when the initial bitmap would each set to 0, and then in turn read for sorting files of integers, the integer bit is set to 1, the last scan bitmap, if one is 1, then the number, output to the sorted documents. For example, for the data S={3,0,4,1,7,2,5}, Max (S)=7, we can set an eight-bit bitmap B, set each bit of the bitmap as 0, that is, B=[0,0,0,0,0,0,0,0], for each integer d in S, set B[d]=1, that is, B=[1,1,1,1,1,1, 1,1,1,1,1,1], finally scan the bitmap, to each bit of the bitmap I, if B[I]==1, output I to the sorted file, sorted S={0,1,2,3,4,5,7}.
The whole process only needs to go through the files and bitmaps to be sorted once, the time complexity is O(n), and the auxiliary space required is (Max (S)/8)B. Although this sorting algorithm can only be run on no repeat a set of integers, but for some requirements, does effective implementation, such as phone number, mobile number 11, the first is always 1, theoretically can have 10 ^ 10 Numbers, but some number yet, that is some number in the system does not exist, assuming that 50% of the legal system number, each number with long int, said so much number needed space 50% * 10 ^ (10) * 4 b = 20 gb, not placed in memory for quick sort. An alternative is to merge sort in multiple passes, but it takes a long time. We apply for a 10^10 bit bitmap, the memory required is 10^10/8B=1.25GB, it can be fully run on the contemporary PC, when scanning the bitmap, assume a bit I is 1, output file, add a 1 in front, such as I =3885201314, output is 13885201314.

Second, algorithm implementation

  Using c language to achieve, you need to encapsulate bitmap operations, here need to use three operations: set all bitmap bits for 0(setAllZero); Set the specified bit to 1(setOne); To see if the specified bit is 1(find); The code is as follows:

 


 #include <malloc.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#define MAX_NUM 16777216//The largest number, that's the number of bits
#define BYTE_NUM (1+MAX_NUM/8)//The number of bytes
#define MASK 0x07
void setAllZero(unsigned char *p,long size);
void setOne(unsigned char *p,long loc);
int find(unsigned char *p,long loc);
bool getSorted(unsigned char *bitmap,char *fileName);
bool setBitmap(unsigned char *bitmap,char *fileName);
int bitmapSort();
int main(){
    return bitmapSort();
}
int bitmapSort(){
    unsigned char *bitmap;    //Bitmap pointer
    bitmap = (unsigned char *)malloc(BYTE_NUM*sizeof(unsigned char));
    if(bitmap == NULL){
        printf("Malloc failedn");
        return -1;
    }    
    setAllZero(bitmap,BYTE_NUM);//Set all bitmap bits to 0
    setBitmap(bitmap,"phoneNumber.txt");//Scan the pending file and set the bitmap to 1
    getSorted(bitmap,"bitmapSort.txt");    //Scan the bitmap and output the bitmap 1 bit number to the file
    free(bitmap);//Release the bitmap
    return 0;
}

bool setBitmap(unsigned char *bitmap,char *fileName){
    FILE *readFp;
    printf("Setting bitmap...n");
    readFp = fopen(fileName,"r");
    if(readFp == NULL)
        return false;    
    long phoneNum=0;
    while(fscanf(readFp,"%ldn",&phoneNum) != EOF){
        setOne(bitmap,phoneNum);//Will & have spent & have spent & have spent PhoneNum bit set to 1& NBSP; & have spent & have spent
    }
    fclose(readFp);
    return true;
}

bool getSorted(unsigned char *bitmap,char *fileName){
    printf("Search bitmap...n");
    FILE *writeFp;
    writeFp = fopen(fileName,"w");
    if(writeFp == NULL)
        return false;
    long phoneNum=0;
    for(phoneNum = 0; phoneNum < MAX_NUM; phoneNum += 1){
        if(find(bitmap,phoneNum)){
            fprintf(writeFp,"%ldn",phoneNum);
        }
    }
    fclose(writeFp);
    return true;
}

void setAllZero(unsigned char *bitmap,long size){
    for(long i=0;i<size;i++)
        *(bitmap+i) &= 0;
}

void setOne(unsigned char *bitmap,long loc){
    *(bitmap+(loc>>3)) |= (1<<(loc&MASK));//
}

int find(unsigned char *bitmap,long loc){
    return ((*(bitmap+(loc>>3))) & (1<<(loc&MASK))) == (1<<(loc&MASK));    
}
 

  C++ STL has a data structure in the bitset, the operation of bitmap is very convenient.

 


 #include <bitset>
#define MAX_NUM 4000000//The most number, the number of digits required
using namespace std;
int main(){
    FILE *readFp,*writeFp;
    readFp = fopen("phoneNumber1.txt","r");        
    writeFp = fopen("bitsetSorted.txt","w");    
    bitset<MAX_NUM> bitmap;
    for(long i=0;i<MAX_NUM;i++){//First, convert the bitmap to 0
        bitmap.set(i,0);
    }
    printf("Begin set bitmap...n");
    long number = 0;
    while(fscanf(readFp,"%ldn",&number) != EOF){
        bitmap.set(number,1);//Set the number bit to 1& PI; & have spent & have spent & have spent & have spent & have spent & have spent
    }
    printf("Begin search bitmap...n");
    for(long i=0;i<MAX_NUM;i++){
        if(bitmap[i] == 1)//Outputs bits of bit 1 to the sorted file
            fprintf(writeFp,"%ldn",number);
    }
    fclose(writeFp);
    fclose(readFp);
}
 

Sorting algorithms write soon ok, begin to generate test data, to generate 0-2 ^ 31 out-of-order data set to return true not easy, first of all make sure not to repeat, the second will lose 40% of the number (invalid mobile phone number), the third order as far as possible, dao for a long time, finally found a way to achieve that generated the 12 gb of data set, about to generate the data set, welcome to discuss, I will be in the next article to summarize my method.
See github for the complete code.


Related articles: