Summary and analysis of massive data processing methods of C++ algorithm

2020-04-02 01:01:20
OfStack
                    
A technique commonly used in mass data processing

1. Bloom Filtering

Basic Bloom Filtering supports fast insertion and lookup operations and is a hash table technology. The basic data structure is very simple, with an array of m bits, k hash functions, and n input elements stored in a bit array.

Every time a new element is inserted, the k hash values of the element are calculated first, and the corresponding hash value of the bit array is set to 1. When searching for an element, the same k hash values are calculated first, and then the query is conducted to see whether the k bits in the corresponding bit array are all 1, and if so, the existence of the element is determined.

The basic Bloom Filtering algorithm can be used for fast weight determination operations that allow for errors. The calculation of the intersection and union of sets.

Bloom Filtering is an improved version of counting Bloom Filtering can support data delete, countering Bloom Filtering and basic Bloom Filtering, compared to an array of each value of expanded into many, basic Bloom Filtering in 1 a bit. When an element is inserted, all k bits are added by 1, while when deleted, all k bits are subtracted by 1. If k values are greater than 0, they are determined to exist. An important parameter in CBF is the number of bits per bit. It can be proved by theory that the number of digits is usually 4 enough to support the insertion of the same data for 16 times.

Bitmap can be seen as a special case of bloom filtering

2. Hash table technology

D-left hash load balancing technique. Divide the hash into d segments, design d hash functions, and select an appropriate segment to store the data with more load. The search computes d hash values, which are found in segment d.

Often used to count times.

3. The reactor technology

Heap has two typical applications:

Multiplexing sort

O TopK

Multiplex merge sort USES maximum heap for descending sort and minimum heap for ascending sort.

When we do TopK, when we do TopK largest, we do the smallest heap, we do TopK smallest heap. When the topK is largest, the minimum heap is used to maintain K values. When the value of the new scan is greater than the heap top element, the heap top element is deleted and the new value is inserted. After scanning the data in this way, we can get the maximum topK.

4. Double-layer barrel (multi-layer barrel) design

Hash table technology is a direct addr technology, but when the data range is too wide, and the data volume is very large, the use of hash table directly direct addr technology is not available, which is the use of multi-tier hash technology. The raw data range is divided into small segments, each memory segment can be loaded, and the direct addr table technique can be used within the segment. Can use the multilayer classification to quickly locate the small segment.