Bloom filter of Bloom Filter Java implementation method

  • 2020-05-19 04:57:19
  • OfStack

Bloon filter principle is very simple: is to hash 1 string into 1 integer key, then select 1 long bit sequence, start with 0, in key to change the position of 0 to 1; The next time a string comes in, the hash value key, if the value in this bit is also 1, then the string exists.

If you do that, it's no different than a hash algorithm, which is repetitive.

The bloom filter hashes 1 string into multiple key, as I said in the book.

You create 1. 6 billion base 2 constants, and then you zero all of those 1. 6 billion base 2 bits. For each string, use 8 different random generators (F1,F2,... ,F8) produces 8 information fingerprints (f1,f2... Then, a random number generator G is used to map these 8 information fingerprints to 8 natural Numbers in the range of 1 to 1.6 billion. g1,g2... , g8. Now change all the base 2 bits of these 8 positions to 1. So you have a bronn filter.

So how do you detect if a string already exists?

Now use 8 random number generators (F1,F2,... ,F8) generate 8 information fingerprints on this string s1,s2... ,s8, and then correspond these 8 information fingerprints to the 8 binary bits of bloon filter, T1,T2... If the string exists, then obviously T1,T2... ,T8 should all be 1 in base 2. That's how you can tell if a string already exists.

In fact, bloom filter is an extension of the hash algorithm, since the essence is hash, then there must be a deficiency, that is to say, there must be a misjudgment, a string clearly did not appear and bloom filter judgment appeared, although the probability is small, but it does exist.

So how do you reduce this probability? Well, the first thing you can think about is if you expand the 8 information fingerprints to 16 errors the probability will definitely go down, but you have to take into account that the number of strings that a bloom filter can store is also going to go down by a factor of 1; The other thing is you take a good hash function, and there are a lot of ways to hash strings, some of which are good hash functions.

Blom filter is mainly used to filter malicious websites. All malicious websites are built on one blom filter, and then the website visited by users is detected. If it is in the malicious website, users will be notified. In this way, we can also set a whitelist to 1 of the commonly misjudged url, and then to appear to judge the existence of the url and then match the url in the whitelist, if in the whitelist, then release. Of course, the whitelist can't be too big, and it won't be too big. The probability of bloon filter error is very small. Interested readers may refer to the bloon filter's error rate.

Java version of the bloom filter source code:


import java.util.BitSet; 
 
/** 
 * 
 * @author xkey 
 */ 
public class BloomFilter { 
 
  private static final int DEFAULT_SIZE = 2 << 24;// Bit length of bloom filter  
  private static final int[] seeds = {3,5,7, 11, 13, 31, 37, 61};// I'm going to pick a prime number here, which is a good way to reduce the error rate  
  private static BitSet bits = new BitSet(DEFAULT_SIZE); 
  private static SimpleHash[] func = new SimpleHash[seeds.length]; 
 
  public static void addValue(String value) 
  { 
    for(SimpleHash f : func)// The string value Hash for 8 One or more integers, and then in those integers bit On a 1 
      bits.set(f.hash(value),true); 
  } 
   
  public static void add(String value) 
  { 
    if(value != null) addValue(value); 
  } 
   
  public static boolean contains(String value) 
  { 
    if(value == null) return false; 
    boolean ret = true; 
    for(SimpleHash f : func)// There's no need to run it all, as long as 1 time ret==false Then it does not contain the string  
      ret = ret && bits.get(f.hash(value)); 
    return ret; 
  } 
   
  public static void main(String[] args) { 
    String value = "www.ofstack.com"; 
    for (int i = 0; i < seeds.length; i++) { 
      func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]); 
    } 
    add(value); 
    System.out.println(contains(value)); 
  } 
} 
 
class SimpleHash {// This thing is equivalent to C++ Structure in  
 
  private int cap; 
  private int seed; 
 
  public SimpleHash(int cap, int seed) { 
    this.cap = cap; 
    this.seed = seed; 
  } 
 
  public int hash(String value) {// String hash. It's important to choose the right hash function  
    int result = 0; 
    int len = value.length(); 
    for (int i = 0; i < len; i++) { 
      result = seed * result + value.charAt(i); 
    } 
    return (cap - 1) & result; 
  } 
} 

Conclusion: bloom filter is an innovation of hash algorithm, and it needs little space consumption and low error rate. In a word, this innovative idea is worth learning, which is the application of data type bit.


Related articles: