Pattern matching of strings BF algorithm and KMP algorithm
- 2020-04-02 02:40:42
- OfStack
A BF algorithm
BF algorithm is a common pattern matching algorithm. The idea of BF algorithm is to match the first character of the target string S with the first character of the pattern string P. If not, the second character of S is compared with the first character of P, and so on, until the final matching result is obtained.
Examples:
S: ababcababa
P: ababa
BF The steps for algorithm matching are as follows
i=0 i=1 i=2 i=3 i=4
The first trip :ababcababa The second trip :ababcababa The third trip :ababcababa On the fourth :ababcababa On the fifth :ababcababa
ababa ababa ababa ababa ababa
j=0 j=1 j=2 j=3 j=4(i and j back )
i=1 i=2 i=3 i=4 i=3
The sixth visit to :ababcababa On the seventh :ababcababa The eighth trip :ababcababa On the ninth :ababcababa On the tenth :ababcababa
ababa ababa ababa ababa ababa
j=0 j=0 j=1 j=2(i and j back ) j=0
i=4 i=5 i=6 i=7 i=8
On the eleventh :ababcababa On the twelfth :ababcababa On 13 :ababcababa On the 14th :ababcababa On the 15th :ababcababa
ababa ababa ababa ababa ababa
j=0 j=0 j=1 j=2 j=3
i=9
On 16 :ababcababa
ababa
j=4( The match is successful )
Code implementation:
int BFMatch(char *s,char *p)
{
int i,j;
i=0;
while(i<strlen(s))
{
j=0;
while(s[i]==p[j]&&j<strlen(p))
{
i++;
j++;
}
if(j==strlen(p))
return i-strlen(p);
i=i-j+1; //Pointer I backtrace
}
return -1;
}
In fact, in the above matching process, many comparisons are redundant. If the fifth pass fails, in the sixth pass, I can stay the same, and j is 2. Because in the previous matching process, for string S, we know that s0s1s2s3=p0p1p2p3, and because of p0! = p1! So the sixth match is redundant. Since p0==p2,p1==p3, the match between the seventh and eighth steps is also redundant. These redundant matches are omitted in the KMP algorithm.
2. KMP algorithm
KMP algorithm is called KMP algorithm because this algorithm is jointly proposed by three people, the three people's first letter as the name of the algorithm. In fact, the difference between KMP algorithm and BF algorithm is that KMP algorithm cleverly eliminates the backtracking problem of pointer I, and only needs to determine the position of j to match next time, so that the complexity of the problem drops from O(mn) to O(m+n).
In KMP algorithm, in order to determine the position of j in the next match when the match fails, next[j] array is introduced, and the value of next[j] indicates that the length of the longest suffix in P[0...j-1] is equal to the prefix of the same character sequence.
The next[] array is defined as follows:
1) the next [j] = 1
J = 0
2) next[j] = Max (k): 0
<
K.
<
j
P = P [0]... k - 1 [j, k, j - 1]
3) the next [j] = 0
other
Such as:
P
a.
b
a.
b
a.
j
0
1
2
3
4
next
- 1
0
0
1
2
The next [j] = k
>
At 0, P[0...k-1]=P[j-k,j-1]
Therefore, the idea of KMP algorithm is: in the matching process, if a mismatch occurs, if next[j]
>
=0, then the pointer I of the target string remains unchanged, and the pointer j of the pattern string is moved to the position of next[j] to continue matching. If next[j]=-1, move I to the right by 1 bit and set j to 0 to continue the comparison.
The code is as follows:
int KMPMatch(char *s,char *p)
{
int next[100];
int i,j;
i=0;
j=0;
getNext(p,next);
while(i<strlen(s))
{
if(j==-1||s[i]==p[j])
{
i++;
j++;
}
else
{
j=next[j]; //Eliminates backtracking of pointer I
}
if(j==strlen(p))
return i-strlen(p);
}
return -1;
}
Therefore, the key of KMP algorithm is to calculate the value of next[] array, that is, to calculate the length of the longest suffix and prefix at each position of the pattern string. There are two ways to calculate the value of next[] array.
1. According to the idea of recursion:
According to the defined next [0] = 1, assumes that the next [j] = k, namely P [0]... k - 1 = = P/j, k, j - 1
1) if P = = P [j] [k], have P] [0.. k = = P/j, k, j, obviously, next [j + 1) = next [j] + 1 = k + 1;
2) if P [j]. =P[k], then it can be regarded as the problem of pattern matching, that is, how to move the value of k when the match fails, obviously k=next[k].
Therefore, it can be realized as follows:
void getNext(char *p,int *next)
{
int j,k;
next[0]=-1;
j=0;
k=-1;
while(j<strlen(p)-1)
{
if(k==-1||p[j]==p[k]) //In the case of matching,p[j]==p[k]
{
j++;
k++;
next[j]=k;
}
else //p[j]!=p[k]
k=next[k];
}
}
2. Direct solution method
void getNext(char *p,int *next)
{
int i,j,temp;
for(i=0;i<strlen(p);i++)
{
if(i==0)
{
next[i]=-1; //next[0]=-1
}
else if(i==1)
{
next[i]=0; //next[1]=0
}
else
{
temp=i-1;
for(j=temp;j>0;j--)
{
if(equals(p,i,j))
{
next[i]=j; //Find the maximum k
break;
}
}
if(j==0)
next[i]=0;
}
}
}
bool equals(char *p,int i,int j) //Determine whether p[0...j-1] is equal to p[I -j... I -1]
{
int k=0;
int s=i-j;
for(;k<=j-1&&s<=i-1;k++,s++)
{
if(p[k]!=p[s])
return false;
}
return true;
}