Deep understanding of fast pattern matching algorithm of KMP

  • 2020-04-02 01:01:56
  • OfStack

I'm afraid that anyone who has ever used a computer knows that most software with text editing function has a shortcut key, CTRL +f (such as word). This function is mainly used for "find", "replace" and "all replace" functions, which is a typical application of pattern matching, that is, searching for strings in a text file.

1. Pattern matching
The pattern matching model looks something like this: given two string variables S and P, where S becomes the target string, which contains n characters, P is called the pattern string, which contains m characters, where m < = n. The search for pattern P begins at the given location of S (usually the first location of S). If found, returns the position of the pattern P in the target string (that is, the subscript of the first character of P in S). If the pattern string P is not found in the target string S, -1 is returned. This is the definition of pattern matching. Let's see how to implement the pattern matching algorithm.

2. Plain pattern matching
The naive pattern matching algorithm is very simple and easy to understand. The general idea is as follows: start from the first character of S, S0, and compare the characters in P with the characters in S, if S0=P0 &&... When && sm-1 = pm-1, the match is proved to be successful. The rest of the match is unnecessary, and the subscript 0 is returned. If at a certain step Si! = Pi, then the remaining characters in P do not need to be compared, it is impossible to match, and then from the second character in S to the first character in P to compare, the same is also know that Sm = pm-1 or find some I make Si! Is equal to S minus 1. Analogically, if it is known that the n-m starting character in S has not been matched successfully, then it is proved that pattern P is not stored in S. (think about why the emphasis here is n-m) this code implementation should be very simple, starting with the internal implementation of the STRSTR function. Can look at the baidu encyclopedia, give a link (link: http://baike.baidu.com/view/745156.htm), here don't write out, also hurriedly KMP to get down to business?

3. Fast pattern matching algorithm (KMP)
The main reason for the low efficiency of plain pattern matching is the repeated character comparison. There is no connection between the next comparison and the last comparison, which is a shortcoming of naive pattern matching. In fact, the comparison results of the last comparison can be utilized, which results in fast pattern matching. In naive pattern matching, the subscript of the target string S moves step by step, which is not good, and the number of moves does not have to be 1.
Now suppose that the current match goes like this: S0... St St + 1... St + j   With the P0, P1... Pj, the characters that are now trying to match are St+j+1 and Pj+1, and St+j+1! = Pj+1, which means St St+1... St + j and P0, P1... Pj is a perfect match. So at this point, what should be the starting position of the next match in S? According to the naive pattern matching, the next comparison should start from St + 1, and let St + 1 and P0 comparison, but in the fast pattern matching is not so, the fast pattern matching choose St + j + 1 and Pk + 1 comparison, what is K? K is such a value that P0, P1... Pk and Pj - k Pj - k + 1... Pj is a perfect match. Let k=next[j], so P0, P1... Pk and St+j-k St+j-k+1... St plus j is a perfect match. Then the next two characters to be matched should be St+j+1 and Pk+1. Neither S nor P are comparing back to the index 0, which is why KMP is fast.
Now the big question is, how do you get this K? In fact, this K is only related to the pattern string P, and requires m K, K = next[j], so we only need to calculate once to store in the next array, and the time complexity is related to m (linear). So let's see how we evaluate the next array, which is k.
Next [] : set next(0) = -1, if next(j) = k, next[j+1].
(1) if Pk+1 = Pj+1, obviously next[j+1] = k+1. If Pk+1! = Pj+1, then next[j+1] < Next [j], then look for h < K makes P0, P1... Ph = Pj Pj - h - h + 1... Pj = Pk - Pk - h + 1 h... Pk. So h is equal to next of k; See, this is an iterative process. (the previous result is useful for the future value)
(2) if such h is not stored, it means that P0, P1... There are no equal substrings in Pj+1, so next[j+1] =-1.
(3) if such h exists, continue to check whether Ph and Pj are equal. Until we find this equality, or we say minus 1 and we find next[j+1].
Look at the implementation code:

View Code 
int next[20] ={0};
//Note that the return result is an array next, save m k worth, if next[j]=k
//The STR [0] STR [1]... STR = STR [k] [j] - k STR [j - k + 1]... STR [j].
//In this way, when the match between des[t+j+1] and pat[j+1] fails, the next matching position is des[t+j+1] and next[j]+1
void Next(char str[],int len)
{
    next[0] = -1;
    for(int j = 1 ; j < len ; j++)
    {
        int i = next[j-1];
        while(str[j] != str[i+1] && i >= 0)//Iterative process
        {
            i = next[i];
        }
        if(str[j] == str[i+1])
        {
            next[j] = i+1;
        }
        else
        {
            next[j] = -1;
        }
    }
}

Now that we have the k value stored in the next array, we can implement the KMP algorithm:

View Code 
//Des is the target string, pat is the pattern string, and len1 and len2 are the length of the string
int kmp(char des[],int len1,char pat[],int len2)
{
    Next(str2,len2);
    int p=0,s=0;
    while(p < len2  && s < len1)
    {
        if(pat[p] == des[s])
        {
            p++;s++;
        }
        else
        {
            if(p==0) 
            {
                s++;//If the first character fails to match, start with the next character of des
            }
            else
            {
                p = next[p-1]+1;//Use the failure function to determine which characters pat should backtrack to
            }
        }
    }
    if(p < len2)//The whole process failed to match
    {
        return -1;
    }
    return s-len2;
}

Time complexity:
For the Next function approximately close to O(m), the time complexity of KMP algorithm is O(n), so the time complexity of the whole algorithm is O(n+m).
Space complexity:
The space complexity of order (m) is introduced.
4. Apply KMP to an interview question
Given two strings s1 and s2, determine whether s2 can be contained by the string that s1 circulates to. For example, s1=AABCD, s2 =CDAA, returns true, because s1 cyclic shift can become CDAAB. Given s1=ACBD and s2=ACBD returns false.
Analysis: it is not difficult to find that the strings obtained from the shift of s2 will be substrings of the string s1s1. If s2 can be obtained by the cyclic shift of s1, then s2 must be substrings of s1s1, and then the KMP algorithm is not very useful.

Related articles: