JAVA implements KMP algorithm theory and sample code

2020-04-01 02:26:37
OfStack

I. theoretical preparation
Why is the KMP algorithm faster than the traditional string matching algorithm? KMP algorithm is through the analysis of the pattern string, in advance of each position when the occurrence of mismatch, can be omitted to re-match the number of characters. Sorting it out and sending it to a next array, and then comparing it, can avoid backtracking, and some of the results in the pattern string can be reused, reducing the number of loops and improving the matching efficiency. In layman's terms, the KMP algorithm mainly USES some characters of the pattern string to avoid repeated comparison of these positions, just like the characters at the beginning of the pattern string. For example, the main string: abcabcabcabed, the mode string: abcabed. There is no need to start at the beginning of the pattern string when comparing to the 'e' character of the pattern string. And you don't need to backtrace the main string.
The traditional matching algorithm does not make use of the matched information (the pattern string is known, so is the partial matching main string), and it starts from scratch every time, which is very slow.
I'll start with how prefix arrays (I don't know if this is correct) are created. First, there are two concepts to understand: "prefix" and "suffix". "Prefix" means all the head combinations of a string except the last character;" Suffix "refers to all tail combinations of a string except the first character.
Here's an example: chi represents the prefix of the first I characters in the pattern string, next[I] = j means that the first j characters in chi are the same as the last j characters (note that the subscript is the number of characters), and such j is the maximum for the prefix chi. Another definition of next[I] = j is to have a string of j characters that is both the true prefix and the true suffix of chi.
Rule: next[1] = next[0] = 0, this rule is not like 0! =1 that way, but it does look like this, and I don't know what the prefix and suffix are. Note: the next array is not a palindrome, but a prefix equal to the suffix, which is important to understand when recursing the next array. Next [I] is a prefix array. Here is an example of how to construct a prefix array.
Example: cacca has 5 prefixes, find the corresponding next array. Next [3] = 1, prefix 4 is cacc, there is a common character c, so next[4] = 1, prefix 5 is cacca, there is a common character ca, so next[5] = 2. If you look closely, you can see that when you construct next[I], you can take advantage of the results of next[i-1]. For example, abcdabc, the pattern has obtained next[7] = 3, in order to find next[8], you can directly compare the fourth character and the eighth character, if they are equal, then next[8] = next[7]+1 = 4, because next[7] = 3 ensures that the first three characters of the last four characters of the prefix ch7 are the same. But what if these two characters don't want to wait? Keep iterating, using (k=3)k = next[k] until k=0(next[8] =0) or characters equal (next[8] = k+1).
Ii. Algorithm implementation


import java.util.ArrayList;
public class KMP {
 //The main string
 static String str = "1kk23789456789hahha";
 //Pattern string
 static String ch = "789";
 static int next[] = new int[20];

 public static void main(String[] args) {
  setNext();
  ArrayList<Integer> arr = getKmp();
  if(arr.size()!=0) {
   for(int i=0; i<arr.size(); i++) {
    System.out.println(" The match occurs when :"+arr.get(i));
   }
  }else {
   System.out.println(" Unsuccessful match ");
  }
 }
 private static void setNext() {
  // TODO Auto-generated method stub
  int lenCh = ch.length();
  next[0] = 0;
  next[1] = 1;
  //K represents the value of next[I -1]
  int k = 0;
  for(int i=2; i<=lenCh; i++) {
   k = next[k];
   
   while(k!=0 && ch.charAt(i-1)!=ch.charAt(k)) {
    k = next[k];
   }
   if(ch.charAt(i-1)==ch.charAt(k)) {
    k++;
   }//The else is k = 0
   //Instead of next[k] = k, I means a prefix with several characters
   next[i] = k;
  }
 }
 private static ArrayList<Integer> getKmp() {
  // TODO Auto-generated method stub
  ArrayList<Integer> arr = new ArrayList<Integer>();
  int lenStr = str.length();
  int lenCh = ch.length();
  //The main string Matches the starting position 
  int pos = 0;
  //Pattern string Match position every time 
  int k = 0;
  //The loop condition is not k<LenCh, in which case there may be a dead loop (no match occurs)
  while(pos<lenStr) {
   
   k = next[k];
   while(k<lenCh && str.charAt(pos)==ch.charAt(k)) {
    pos++;
    k++;
   }
   if(lenCh==k) {
    arr.add(pos-k);
   }else if(0==k) {
    
    pos++;
   }//Else is k = next[k], so k = next[k] on the last line
  }
  return arr;
 }

}

Three. Problem expansion
KMP algorithm efficiency is often reflected in the pattern string a relatively long time out the derivation process of the array (see next), but in fact patterns tend to be short, when thinking about using office suite of string length, the most practical use of BM algorithm to achieve, interested readers can refer to the relevant information, may be able to look at the multimode matching (more than one in the main string search mode string) AC automata, dictmatch algorithm.