Summary of php big data and massive data processing algorithm

  • 2020-05-05 11:01:41
  • OfStack

The following method is a general summary of the processing methods of massive data. Of course, these methods may not cover all the problems completely, but some of these methods can deal with the vast majority of problems. Some of the following questions are directly from the company's written interview questions, the method is not necessarily optimal, if you have a better way to deal with, welcome to discuss with me.

1.Bloom filter

scope of application: it can be used to implement data dictionary, weigh data, or set the intersection

Basics and essentials:
Simple in principle, bit arrays +k with separate hash functions. Set the bit array of the corresponding values of hash function to 1. If all the corresponding bits of hash function are found to be 1 during the search, it means that they exist. Obviously, this process does not guarantee that the search results are 100% correct. It is also not supported to delete a keyword that has been inserted, because the corresponding bit of the keyword will affect other keywords. So a simple improvement is counting Bloom filter, which replaces a bit array with an counter array to support deletion.

Another important problem is how to determine the size of the bit array m and the number of hash functions according to the number of input elements n. The error rate is minimized when the number of hash functions is k=(ln2)*(m/n). If the error rate is not greater than E, m must be at least equal to n*lg(1/E) to represent any set of n elements. But m should be bigger, because if you want to make sure that at least half of the bit array is 0, then m should be > =nlg(1/E)*lge is roughly nlg(1/E)1.44 times (lg for logarithm base 2).

For example, if we assume an error rate of 0.01, then m should be about 13 times higher than n. So k is about 8.

Note that m has different units from n, m is in units of bit, and n is in units of elements (the number of different elements, to be exact). Usually the length of a single element is a lot of bit. So using bloom filter memory is usually economical.

Extension:
Bloom filter maps the elements of a collection into an array, and whether all 1 of the mapped bits of k (k is the number of hash functions) is used to indicate whether the elements are in the collection. Counting bloom filter (CBF) supports the deletion of elements by extending each bit in the bit array to one counter. Spectral Bloom Filter (SBF) associates it with the number of occurrences of collection elements. SBF USES the minimum value in counter to approximate the occurrence frequency of elements.

Problem example: here are two files A and B, each of which holds 5 billion URL. Each URL takes up 64 bytes. The memory limit is 4G. What if there are three or even n files?

Based on this problem, we calculate the memory usage, 4G=2^32 is about 4 billion *8 is about 34 billion, n=5 billion, which is about 65 billion bit if the error rate is 0.01. What's available now is $34 billion, not much of a difference, and that might make the error rate go up a little bit. In addition, if these urlip are one-to-one, you can convert to ip, which is much easier.

2.Hashing

Scope of application: quick find, delete the basic data structure, usually requires total data can be put into memory

Fundamentals and key points:
hash function selection, for string, integer, permutation, specific hash method.
Collision processing, one is open hashing, also known as the zipper method; The other is closed hashing, also known as the open address method, opened addressing. (http: / / www. my400800. cn)

Extension:
d-left hashing d means multiple, so let's simplify this by looking at 2-left hashing. 2-left hashing refers to the division of a hash table into two halves of equal length, called T1 and T2, respectively. A hash function is given to T1 and T2, h1 and h2, respectively. When storing a new key, two hash functions are used to calculate the two addresses h1[key] and h2[key]. It is necessary to check the h1[key] position in T1 and h2[key] position in T2, which has more (collated) key already stored, and then store the new key in a position with less load. If there are equal Numbers on both sides, such as if both locations are empty or an key is stored in both, the new key is stored in the T1 subtable on the left, from which 2-left is derived. When looking for an key, you have to do hash twice, looking for both locations.

Problem example:
1). Extract the IP with the largest number of visits to baidu on a certain day with massive log data.

The number of IP is limited, 2^32 at most, so consider using hash to store ip directly in memory and count it.

3.bit-map

Scope of application: it can be used for quick search, weight determination and deletion of data. Generally speaking, the data range is less than 10 times that of

Basics and essentials: use the bit array to indicate whether certain elements exist, such as the 8-bit phone number

Extension: bloom filter can be seen as an extension of bit-map

Problem example:

1) given that a file contains some phone Numbers, each of which is an 8-digit number, count the number of different Numbers.

8 bits Max 99 999 999 999, approximately 99m bit, approximately 10 m bytes of memory.

2) find out the number of non-repeating integers among 250 million integers. The memory space is not enough to hold the 250 million integers.

Expand bit-map, and use 2bit to represent a number. 0 means no occurrence, 1 means one occurrence, and 2 means two or more occurrences. Or instead of 2bit, we can simulate this 2bit-map with two bit-map.

4
pile
Scope of application of
: n is large before massive data, n is relatively small, and the heap can be put into memory

Basic principles and key points: the maximum heap before n is small, the minimum heap before n is large. Methods such as n are small. We compare the current element with the largest element in the maximum heap. If it is less than the largest element, we should replace that largest element. The resulting n elements are the smallest n elements. It is suitable for large amount of data, and the former n is small, while the former n is relatively small. In this way, all the former n elements can be obtained by scanning once, with high efficiency.

Extension: double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.

Problem example:
1) find the largest number of the top 100 in the number of 100w.

Use a minimum heap of 100 elements.

5. Double barrel division -- essentially the idea of divide and conquer, with emphasis on the technique of divide!

Scope of application: large, median, non-repeating or repeating Numbers

Rationale and key points: because the element range is too large to make use of the direct addressing table, the scope is gradually determined through multiple partitions, and then finally within an acceptable range. You can do this multiple times, double layers is just one example.

Extension:

Problem example:
1). Find out the number of non-repeating integers among 250 million integers. The memory space is not enough to hold these 250 million integers.

It's a little bit like the pigeon's nest principle, where the number of integers is 2^32, which is, we can divide this 2^32 number into 2^8 regions (for example, a single file represents an area), and then separate the data into different regions, and then the different regions can be solved directly with bitmap. That is, as long as there is enough disk space, can be very easy to solve.

2). 500 million int find their median.

This example is more obvious than the one above. First we divide int into 2^16 regions, then we read the data and calculate the number of Numbers in each region. Then we can determine the median number in that region based on the statistical results, and know which of the largest Numbers in this region is exactly the median. And then the second scan we just count the Numbers that fall in this region.

In fact, if int had been int64 instead of int, we could have reduced it to an acceptable level after three such partitions. That is, int64 can be divided into 2^24 regions, and then determine the largest number of the region, and then divide the region into 2^20 subregions, and then determine the largest number of the subregion, and then the number of the subregion is only 2^20, you can directly use direct addr table for statistics.

6. Database index

Scope of application of
: increase, delete, change and check
for large data volume
Basic principle and key points: using the design and implementation method of data, to increase, delete, change and check the massive data processing.
Extension:
Problem example:


7. Inverted index (Inverted index)

Scope of application: search engine, keyword query

Rationale and key points: why is it called inverted index? An indexing method used to store a map of where a word is stored in a document or group of documents under a full-text search.

In English, for example, here is the text to be indexed:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We get the following reverse file index:
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
The retrieval conditions "what", "is" and "it" will correspond to the intersection of the set.

A forward index is developed to store a list of words for each document. The query of forward index is often satisfied with the regular full-text query of every document and the verification of every word in the check document. In a forward index, the document occupies a central position, and each document points to a sequence of the index items it contains. That is, the document points to the words it contains, and the reverse index is the word points to the document that contains it, and it's easy to see the reverse relationship.

Extension:

Example of a problem: a document retrieval system that queries files that contain a particular word, such as a keyword search for a common academic paper.

8. External sorting

Scope of application: sorting of big data, deduplication

Basic principles and key points: merge method of external sorting, principle of permutation selection loser tree, optimal merge tree

Extension:

Problem example:
1). There is a file size of 1G, in which each line is a word, the size of the word does not exceed 16 bytes, the memory limit is 1M. Returns the 100 words with the highest frequency.

The size of the word is 16 bytes, but the memory of 1m is not enough to do hash, so it can be used for sorting. Memory can be used as an input buffer.

9. trie
tree
scope: large amount of data, many repetitions, but small types of data can be put into memory

Basic principles and key points: implementation mode, node child presentation mode

Extension: compression implementation.

Problem example:
1). There are 10 files, each of which is 1G. Each line of each file contains the user's query. You're asked to sort by the frequency of query.

2).10 million strings, some of which are the same (duplicate), need to remove all the duplicate, keep no duplicate string. How to design and implement?

3). Search for hot queries: query string repetition is relatively high, although the total is 10 million, but if the number of repeated, not more than 3 million, each not more than 255 bytes.

10. Distributed processing mapreduce

Scope of application: large amount of data, but small type of data can be put into memory

Basic principles and key points: data to different machines to process, data division, results reduction.

Extension:

Problem example:

1).The canonical example application of MapReduce is a process to count the appearances of

each different word in a set of documents:
void map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
EmitIntermediate(w, 1);

void reduce(String word, Iterator partialCounts):
// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(result);
Here, each document is split in words, and each word is counted initially with a "1" value by

the Map function, using the word as the result key. The framework puts together all the pairs

with the same key and feeds them to the same call to Reduce, thus this function just needs to

sum all of its input values to find the total appearances of that word.

2). The mass data is distributed in 100 computers. Find a way to efficiently calculate the TOP10 of this batch of data.

3). There are altogether N machines, and each machine has the number of N. Each machine stores a maximum of O(N) and operates on them. How do I find the median of the number N^2 (median)?


classic problem analysis

Tens of millions of or billion data (with duplication), the statistics of the most frequent occurrence of the former N data, divided into two situations: can be read into memory, not read into memory.

Available ideas: trie tree + heap, database index, subsets statistics, hash, distributed computing, approximate statistics, external sorting

The so-called whether can read into memory at a time, actually should mean the amount of data after the elimination of the repetition. If the data can be put into memory after the de-duplication, we can establish a dictionary for the data, such as map, hashmap, trie, and then directly conduct statistics. Of course, when updating the occurrence number of each piece of data, we can use a heap to maintain the former N data with the highest occurrence number. Of course, this results in an increase in maintenance number, which is not as efficient as calculating N before complete statistics.

If the data cannot be put into memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation. The change we can make is to store the dictionary on the hard disk instead of in memory, which can refer to the storage method of the database.

And, of course, a better approach, it is can be used in a distributed computing, is basically map - reduce process, first of all, can according to the data values or the data hash (md5) value, after the data according to the range divided to different machines, best can let data partitioning can be read into memory at a time, after such different machines is responsible for handling all kinds of numerical range, is actually map. After obtaining the result, each machine only needs to take out the first N data with the highest occurrence frequency, and then summarize and select the first N data with the highest occurrence frequency among all the data, which is actually the reduce process.

In fact, you may want to divide the data directly into different machines for processing, so you cannot get the correct solution. Because one data may be divided equally among different machines, and the other may be completely clustered on one machine, and there may be an equal number of data. Occurrences such as we are looking for most of the top 100, we will be 10 million data distribution to 10 machine, find each appears before 100, most times after merge so can't guarantee to find the real 100th, because such as the 100th in the most times there may be 10000, but it has been assigned to the ten machines, so at every stage with only 1000, assuming that these machines ranks in 1000 before those who are single distribution on a computer, such as 1001, it already has 10000 this will be eliminated, Even if we were to ask each machine to pick out the 1,000 most common ones to merge, we would still make an error, because there could be a lot of 1,001 clusters. Therefore, you can't divide the data randomly into different machines, but map them to different machines according to the values after hash, so that different machines can process a numerical range.

The outside sort method consumes a lot of IO and is not very efficient. The above distributed approach, which can also be used for stand-alone versions, is to divide the total data into different subfiles according to the range of values and then process them one by one. After that, we merge the words and their frequency. You can actually take advantage of an external sort merge process.

Another thing to consider is approximation, which is that we can put this scale into memory by combining natural language properties and using only those words that actually appear the most as a dictionary.

Related articles: