C Method for Calculating String Similarity

  • 2021-07-06 11:39:59
  • OfStack

In this paper, an example is given to describe the method of calculating string similarity in C #. Share it for your reference. The details are as follows:

There are many ways to calculate string similarity, and even the dumbest way is to match one by one. What we want to talk about here is to use Levinstein distance to calculate string similarity.
Levinstein distance concept: Assume that the function name is LD

Used to calculate the similarity between two strings. For example, there are two strings A and B. Assuming that A is used as a benchmark, the algorithm calculates how many steps it takes to change B into A by (replacing, deleting, adding characters).

For example:
A = "abcd", B = "abc", then LD (A, B) = 1, just insert one character in the B string and it is exactly equal to A
A = "abcd", B = "abcd", then LD (A, B) =, because these two goods are exactly the same
A = "abcd", B = "abdc", then LD (A, B) = 1, because just swapping the position of "dc" in B equals A.
A= "fwegwegweg @ # 2", B= "ddd* & & %^ & ", then LD (A, B) =????, this uncle really doesn't know, let's use the program.
The larger the calculated value of the Levinstein distance, the more steps there are, and the less similar the two strings are.

For example, if you want to make a simple "article plagiarism" judgment function, you can completely realize a preliminary method with this Levinstein distance.

Algorithm notes:
1. Assume that the string str1 is n and the string str2 is m.
If n = 0, return m and exit; (This is nonsense.)
2. If m=0, return n and exit. (This is still nonsense.)
3. If none of the above is true, the calculation should be started.

Build an array d [0. m, 0. n].
Initialize Row 0 to 0. n and Column 0 to 0. m.
Check each letter of str1 in turn (i=1. n).
Check each letter of str2 in turn (j=1. m).
If str1 [i] = str2 [j], sign=0; (sign is just a tag, without any meaning, to record equality or inequality)
If str1 [i]! = str12 [j], then sign=1.
Set d [i, j] to the smallest of the following three values:
The value of the grid immediately above the current grid plus 1, that is, d [i-1, j] +1
The value of the lattice immediately to the left of the current lattice is added by 1, that is, d [i, j-1] +1
Add sign to the value of the upper left cell of the current cell, that is, d [i-1, j-1] + sign
Repeat the above steps until the end of the loop. d [n, m] is the final value

Next is the realization of a Levinstein distance written in c #.


public class LDMaker// Make it 1 Classes look professional, 
 // In fact, it is to take off your pants and fart. Here, we use the Levinstein distance algorithm 
 // Used to calculate the similarity between strings 
  {
    char[] str1;
    char[] str2;
    public LDMaker(string s1, string s2)
    {
  // Replace   All   Figures   Is a fixed number   Digital jamming   Too serious 
  // Here, it varies from person to person. In the matching of Chinese articles, numbers interfere seriously 
  // So I replaced the numbers in some applications. 
  // The reason is that some articles are actually very similar, but they are deliberately added in them 1 Some numbers 
  // Interference with the execution of this function, so that the machine can see that the two articles are very different. 1 You don't need to do the following 
  //  Steps 
  s1=System.Text.RegularExpressions.Regex.Replace(s1,@"(\d+)","1");
  s2 = System.Text.RegularExpressions.Regex.Replace(s2, @"(\d+)", "1");
  str1 = s1.ToCharArray();
  str2 = s2.ToCharArray();
}
public int GetLD()// This is the algorithm implementation of Levinstein distance 
{
  try
  {
    int m=str1.Length;
    int n=str2.Length;
    int[,] d = new int[m+1, n+1];
    for (int i = 0; i <= m ; i++)
      d[i, 0] = i;
    for (int i = 0; i <= n ; i++)
      d[0, i] = i;
    for (int i = 1; i <= m; i++)
    {
      for (int j = 1; j <= n; j++)
      {
      d[i,j] = d[i - 1,j - 1] + (str1[i - 1] == str2[j - 1] ? 0 : 1);
      // Modify 1 Characters 
       d[i,j] = Math.Min(d[i,j], d[i - 1,j] + 1);
      //  Insert 1 String  
      d[i,j] = Math.Min(d[i,j], d[i,j - 1] + 1); 
      // Delete 1 Characters  
      } 
    } 
    return d[m, n];
    } catch(// Error return 1 A very large value 
    { 
      return 10000;
    }
   } 
}

I hope this article is helpful to everyone's C # programming.


Related articles: