C++ bronn filter for data structures

  • 2020-05-24 05:50:03
  • OfStack

Bloom filter

1. Historical background

Bloom filter (Bloom Filter) was proposed by bloom in 1970. It's actually a very long base 2 vector and a series of random mapping functions. Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the spatial efficiency and query time are much more than the 1-like algorithm, but its disadvantage is that it has a fixed misidentification rate and deletion errors. This drawback is inevitable. However, there is absolutely no identification error (i.e., false counter example False negatives, if an element is not in the set, then Bloom Filter will not report that the element is in the set, so it will not be missed).

In FBI, whether a suspect's name is already on the suspect list; In a web crawler, whether a web address has been visited, etc. The most direct method is to store all the elements in the collection in the computer, and when a new element is encountered, compare it directly with the elements in the collection. 1 generally speaking, collections in a computer are stored in a hash table (hash table). It has the advantage of being fast and accurate, and the disadvantage of consuming storage space. This is not a significant problem when the set is small, but when the set is large, the problem of inefficient hash table storage becomes apparent.

For example, a public email (email) provider like Yahoo,Hotmail, and Gmai will always need to filter spam from spammers (spamer). One way to do this is to record the email addresses of the spammers. Because senders are constantly signing up for new addresses, there are at least a few billion spam addresses around the world, and storing them all requires a lot of web servers. If use hash table, 100 million email per store address, just need 1.6 GB memory (hash table is used to implement the specific measures is to every 1 email address corresponding to 1 8 bytes of fingerprint information, and then the information fingerprint in the hash table, as a result of the hash table storage efficiency 1 only 50%, thus require a email address 106 bytes. 100 million addresses is about 1.6GB, or 10.6 gigabytes of memory). So storing a few billion email addresses might require hundreds of GB of memory. Unless it is a supercomputer, no 1 server can store it [2].

2. Bronn filter principle, advantages and disadvantages

If you want to determine whether or not an element is in a set, one might want to save all the elements in the set and then compare them. Linked lists, trees, hash tables (Hash table), and other data structures are all part of this idea. But as the number of elements in the collection increases, we need more and more storage space. At the same time, the retrieval speed will become slower and slower.

Bloom Filter is a kind of random data structure with high spatial efficiency. Bloom Filter can be seen as an extension of bit-map. Its principle is as follows:

When an element is added to a collection, the element is mapped to K points in a one-bit array (Bit array) by K hash functions and placed as 1. To retrieve, we only need to see if all the points are 1 to know whether it is in the collection or not:

If any one of these points has a 0, the retrieved element 1 is definitely not present.

If both are 1, the retrieved element is most likely in.

Advantages:

It has the advantage of space efficiency and query time is far more than 1 algorithm, bloom filter storage space and insert \ O query time (K), in addition, the hash function has nothing to do with each other, convenient hardware parallel implementation, bloom filter don't need to store the element itself, in some occasions has advantages of very strict confidentiality.

Disadvantages:

1. The disadvantages and advantages of bloom filter are equally obvious. The error rate is one of them. As the number of stored elements increases, the error rate increases. But if the number of elements is too small, a hash is fine.

2, 1 generally can not remove elements from the bloom filter, we are easy to think of a bit array into an integer array, every insert of an element corresponding to the counter 1, so that the delete element counter will be subtracted. However, ensuring that elements are safely deleted is not so simple. First we have to make sure that the deleted element is actually in the bloon filter, and there is also a problem with the counter winding.

3. example

The Google Chrome browser USES Bloom filter to identify malicious links (which can represent a large data set with a smaller storage space, simply mapping every URL to bit) more often, with an error rate of less than 1 in 10,000.

C + + implementation

bit_set.h


#pragma once 
#include<iostream> 
using namespace std; 
#include<vector> 
 
class Bitset 
{ 
public: 
  Bitset(size_t value) 
  { 
    _a.resize((value >> 5) + 1, 0); 
  } 
  bool set(size_t num) 
  { 
    size_t index = num>>5; 
    size_t pos = num % 32; 
    if (_a[index] & (1 << (31 - pos))) 
    { 
      return false; 
    } 
    else 
    { 
      _a[index] |= (1 << (31 - pos)); 
      _size++; 
      return true; 
    } 
     
  } 
  bool Reset(size_t num) 
  { 
    size_t index = num >> 5; 
    size_t pos = num % 32; 
    if (Text(num)) 
    { 
      _a[index] &= ~(1 << (31 - pos)); 
      _size--; 
      return true; 
    } 
    else 
    { 
      return false; 
    } 
  } 
  bool Text(size_t num) 
  { 
    size_t index = num >> 5; 
    size_t pos = num % 32; 
    return _a[index] & (1 << (31-pos)); 
  } 
private: 
  vector<int> _a; 
  size_t _size; 
}; 

Hash.h


#pragma once 
template<class T> // Various hash string conversion functions   
size_t BKDRHash(const char *str) 
{ 
  register size_t hash = 0; 
  while (size_t ch = (size_t)*str++) 
  { 
    hash = hash * 131 + ch; 
  } 
  return hash; 
} 
 
template<class T> 
size_t SDBMHash(const char *str) 
{ 
  register size_t hash = 0; 
  while (size_t ch = (size_t)*str++) 
  { 
    hash = 65599 * hash + ch; 
  } 
  return hash; 
} 
 
template<class T> 
size_t RSHash(const char * str) 
{ 
  size_t hash = 0; 
  size_t magic = 63689; 
  while (size_t ch = (size_t)*str++) 
  { 
    hash = hash * magic + ch; 
    magic *= 378551; 
  } 
  return hash; 
} 
 
 
template<class T> 
size_t APHash(const char *str) 
{ 
  register size_t hash = 0; 
  size_t ch; 
  for (long i = 0; ch = (size_t)*str++; i++) 
  { 
    if ((i & 1) == 0) 
    { 
      hash ^= ((hash << 7) ^ ch ^ (hash >> 3)); 
    } 
    else 
    { 
      hash ^= (~((hash << 11) ^ ch ^ (hash >> 5))); 
    } 
  } 
  return hash; 
} 
 
 
template<class T> 
size_t JSHash(const char* str) 
{ 
  if (!*str) 
  { 
    return 0; 
  } 
  size_t hash = 1315423911; 
  while (size_t ch = (size_t)*str++) 
  { 
    hash ^= ((hash << 5) + ch + (hash >> 2)); 
  } 
  return hash; 
} 

Bloom_Filter.h


#pragma once 
 
#include"bite_set.h" 
#include"Hash.h" 
#include<string> 
 
template<class T> 
struct __HashFunk1 
{ 
  size_t operator()(const T& key) 
  { 
    return BKDRHash<T>(key.c_str()); 
  } 
}; 
 
template<class T> 
struct __HashFunk2 
{ 
  size_t operator()(const T& key) 
  { 
    return SDBMHash<T>(key.c_str()); 
  } 
};  
 
template<class T> 
struct __HashFunk3 
{ 
  size_t operator()(const T& key) 
  { 
    return RSHash<T>(key.c_str()); 
  } 
}; 
 
template<class T> 
struct __HashFunk4 
{ 
  size_t operator()(const T& key) 
  { 
    return APHash<T>(key.c_str()); 
  } 
}; 
 
template<class T> 
struct __HashFunk5 
{ 
  size_t operator()(const T& key) 
  { 
    return JSHash<T>(key.c_str()); 
  } 
}; 
 
 
template<class K = string, 
class HashFunk1 = __HashFunk1<K>, 
class HashFunk2 = __HashFunk2<K>, 
class HashFunk3 = __HashFunk3<K>, 
class HashFunk4 = __HashFunk4<K>, 
class HashFunk5 = __HashFunk5<K>> 
 
class BoolFilter 
{ 
public: 
  BoolFilter(size_t n) 
    :_a(n * 10) 
    , _range(n * 10) 
  {} 
 
  void set(const K& key) 
  { 
    _a.set(HashFunk1()(key) % _range); 
    _a.set(HashFunk2()(key) % _range); 
    _a.set(HashFunk3()(key) % _range); 
    _a.set(HashFunk4()(key) % _range); 
    _a.set(HashFunk5()(key) % _range); 
  } 
 
  bool Text(const K& key) 
  { 
    if (!_a.Text(HashFunk1()(key)% _range)) 
      return false; 
    if (!_a.Text(HashFunk2()(key) % _range)) 
      return false; 
    if (!_a.Text(HashFunk3()(key) % _range)) 
      return false; 
    if (!_a.Text(HashFunk4()(key) % _range)) 
      return false; 
    if (!_a.Text(HashFunk5()(key) % _range)) 
      return false; 
    return true; 
  } 
private: 
  Bitset _a; 
  size_t _range; 
}; 



Related articles: