Pattern matching of strings BF algorithm and KMP algorithm

  • 2020-04-02 02:40:42
  • OfStack

A BF algorithm
      BF algorithm is a common pattern matching algorithm. The idea of BF algorithm is to match the first character of the target string S with the first character of the pattern string P. If not, the second character of S is compared with the first character of P, and so on, until the final matching result is obtained.

    Examples:


  S: ababcababa
  P: ababa
  BF The steps for algorithm matching are as follows 
      i=0                  i=1               i=2             i=3             i=4
  The first trip :ababcababa      The second trip :ababcababa    The third trip :ababcababa   On the fourth :ababcababa   On the fifth :ababcababa
       ababa              ababa             ababa            ababa            ababa
      j=0                  j=1              j=2             j=3             j=4(i and j back )
       i=1                 i=2              i=3              i=4            i=3
  The sixth visit to :ababcababa      On the seventh :ababcababa     The eighth trip :ababcababa    On the ninth :ababcababa   On the tenth :ababcababa
       ababa               ababa              ababa            ababa            ababa
       j=0                 j=0              j=1              j=2(i and j back )      j=0
       i=4                  i=5             i=6              i=7             i=8
 On the eleventh :ababcababa     On the twelfth :ababcababa   On 13 :ababcababa   On the 14th :ababcababa   On the 15th :ababcababa
           ababa                ababa              ababa             ababa             ababa
        j=0                  j=0             j=1              j=2             j=3
 
          i=9
 On 16 :ababcababa
            ababa
          j=4( The match is successful )

Code implementation:


int BFMatch(char *s,char *p)
{
  int i,j;
  i=0;
  while(i<strlen(s))
  {
    j=0;
    while(s[i]==p[j]&&j<strlen(p))
    {
      i++;
      j++;
    }
    if(j==strlen(p))
      return i-strlen(p);
    i=i-j+1;        //Pointer I backtrace
  }
  return -1;  
}

  In fact, in the above matching process, many comparisons are redundant. If the fifth pass fails, in the sixth pass, I can stay the same, and j is 2. Because in the previous matching process, for string S, we know that s0s1s2s3=p0p1p2p3, and because of p0! = p1! So the sixth match is redundant. Since p0==p2,p1==p3, the match between the seventh and eighth steps is also redundant. These redundant matches are omitted in the KMP algorithm.

2. KMP algorithm

      KMP algorithm is called KMP algorithm because this algorithm is jointly proposed by three people, the three people's first letter as the name of the algorithm. In fact, the difference between KMP algorithm and BF algorithm is that KMP algorithm cleverly eliminates the backtracking problem of pointer I, and only needs to determine the position of j to match next time, so that the complexity of the problem drops from O(mn) to O(m+n).
In KMP algorithm, in order to determine the position of j in the next match when the match fails, next[j] array is introduced, and the value of next[j] indicates that the length of the longest suffix in P[0...j-1] is equal to the prefix of the same character sequence.
The next[] array is defined as follows:
1) the next [j] = 1   J = 0
2) next[j] = Max (k): 0 < K. < j     P = P [0]... k - 1 [j, k, j - 1]
3) the next [j] = 0   other
Such as:
P           a.       b     a.       b     a.
j           0       1     2       3     4
  next       - 1     0     0       1     2
The next [j] = k > At 0, P[0...k-1]=P[j-k,j-1]
Therefore, the idea of KMP algorithm is: in the matching process, if a mismatch occurs, if next[j] > =0, then the pointer I of the target string remains unchanged, and the pointer j of the pattern string is moved to the position of next[j] to continue matching. If next[j]=-1, move I to the right by 1 bit and set j to 0 to continue the comparison.
The code is as follows:


int KMPMatch(char *s,char *p)
{
  int next[100];
  int i,j;
  i=0;
  j=0;
  getNext(p,next);
  while(i<strlen(s))
  {
    if(j==-1||s[i]==p[j])
    {
      i++;
      j++;
    }
    else
    {
      j=next[j];    //Eliminates backtracking of pointer I
    }
    if(j==strlen(p))
      return i-strlen(p);
  }
  return -1;
}

Therefore, the key of KMP algorithm is to calculate the value of next[] array, that is, to calculate the length of the longest suffix and prefix at each position of the pattern string. There are two ways to calculate the value of next[] array.
1. According to the idea of recursion:
    According to the defined next [0] = 1, assumes that the next [j] = k, namely P [0]... k - 1 = = P/j, k, j - 1
    1) if P = = P [j] [k], have P] [0.. k = = P/j, k, j, obviously, next [j + 1) = next [j] + 1 = k + 1;
    2) if P [j]. =P[k], then it can be regarded as the problem of pattern matching, that is, how to move the value of k when the match fails, obviously k=next[k].
    Therefore, it can be realized as follows:


void getNext(char *p,int *next)
{
  int j,k;
  next[0]=-1;
  j=0;
  k=-1;
  while(j<strlen(p)-1)
  {
    if(k==-1||p[j]==p[k])  //In the case of matching,p[j]==p[k]
    {
      j++;
      k++;
      next[j]=k;
    }
    else          //p[j]!=p[k]
      k=next[k];
  }
}

 
    2. Direct solution method


void getNext(char *p,int *next)
{
  int i,j,temp;
  for(i=0;i<strlen(p);i++)
  {
    if(i==0)
    {
      next[i]=-1;   //next[0]=-1
    }
    else if(i==1) 
    {
      next[i]=0;   //next[1]=0
    }
    else
    {
      temp=i-1;
      for(j=temp;j>0;j--)
      {
        if(equals(p,i,j))
        {
          next[i]=j;  //Find the maximum k
          break;
        }
      }
      if(j==0)
        next[i]=0;
    }
  }
}
bool equals(char *p,int i,int j)   //Determine whether p[0...j-1] is equal to p[I -j... I -1]
{
  int k=0;
  int s=i-j;
  for(;k<=j-1&&s<=i-1;k++,s++)
  {
    if(p[k]!=p[s])
      return false;
  }
  return true;
}


Related articles: