Similarity calculation function of PHP: introduction to the use of levenshtein

  • 2020-06-01 08:24:20
  • OfStack

Directions for use
Read the description of the levenshtein() function in the manual first:

The levenshtein() function returns the Levenshtein distance between two strings.

Levenshtein distance, also known as edit distance, refers to the minimum number of edits required between two strings to convert from one to the other. The permitted edit operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

sitten (k - s)
sittin (e - i)
The sitting (→g) levenshtein() function gives the same weight to each operation (replace, insert, and delete). However, you can define the cost of each operation by setting the optional insert, replace, delete parameters.

Grammar:

levenshtein(string1,string2,insert,replace,delete)

Parameters to describe

string1 required. The first string to compare.
string2 required. The second string to compare.
insert optional. The cost of inserting 1 character. The default is 1.
replace optional. The cost of replacing 1 character. The default is 1.
delete optional. The cost of deleting 1 character. The default is 1.
Hints and comments

If one of the strings is more than 255 characters long, the levenshtein() function returns -1.
The levenshtein() function is case insensitive.
The levenshtein() function is faster than the similar_text() function. However, the similar_text() function provides more precise results that require fewer modifications.
example


<?php
    echo levenshtein("Hello World","ello World");
    echo "<br />";
    echo levenshtein("Hello World","ello World",10,20,30);
    ?>

Output: 1 30

Source code analysis
levenshtein() is a standard function, and there is a file in the /ext/standard/ directory specifically for this function implementation: levenshtein.c.

levenshtein() will select the implementation method according to the number of parameters. If the parameter is 2 and parameter is 5, reference_levdist() function will be called to calculate the distance. The difference is that for the last three parameters, if the parameter is 2, the default value is 1.

And in the implementation source we found a situation that was not documented: the levenshtein() function can also pass three parameters, which will eventually call the custom_levdist() function. It takes the third parameter as the implementation of the custom function, and an example of its invocation is shown below:


echo levenshtein("Hello World","ello World", 'strsub');

Report Warning: The general Levenshtein support is not there yet. This is because this method is not implemented yet, it just puts a hole in there.

The implementation algorithm of reference_levdist() function is a classic DP problem.

Given two strings x and y, find the minimum number of changes to turn x into y. The rules for modification can only be one of three: delete, insert, and change.
Use a[i][j] to represent the minimum number of operations required to convert the first i characters of x into the first j characters of y, and the state transfer equation is:


 when x[i]==y[j] When: a[i][j]  = min(a[i-1][j-1], a[i-1][j]+1, a[i][j-1]+1);
 when x[i]!=y[j] When: a[i][j] =  min(a[i-1][j-1], a[i-1][j], a[i][j-1])+1;

Before using the state transfer equation, we need to initialize the (n+1)(m+1) matrix, d, and increase the value of the first row and column from 0. Scan two strings (nm level), compare the characters, use the state transfer equation, and end up with $a[$l1][$l2] as its result.

The simple implementation process is as follows:


<?PHP
    $s1 = "abcdd";
    $l1 = strlen($s1);
    $s2 = "aabbd";
    $l2 = strlen($s2);

 
    for ($i = 0; $i < $l1; $i++) {
        $a[0][$i + 1] = $i + 1;
    }
    for ($i = 0; $i < $l2; $i++) {
        $a[$i + 1][0] = $i + 1;
    }

    for ($i = 0; $i < $l2; $i++) {
        for ($j = 0; $j < $l1; $j++) {
            if ($s2[$i] == $s1[$j]) {
                $a[$i + 1][$j + 1] = min($a[$i][$j], $a[$i][$j + 1] + 1, $a[$i + 1][$j] + 1);
            }else{
                $a[$i + 1][$j + 1] = min($a[$i][$j], $a[$i][$j + 1], $a[$i + 1][$j]) + 1;
            }
        }
    }

    echo $a[$l1][$l2];
    echo "n";
    echo levenshtein($s1, $s2);

In the implementation of PHP, the implementer clearly indicates in the comments that this function only optimizes the memory usage without considering the speed, and the time complexity is O(m×n) according to its implementation algorithm. The optimization is to change the 2 - dimensional array in the above state - transfer equation into two sets of 1 arrays. The simple implementation is as follows:

<?PHP
    $s1 = "abcjfdkslfdd";
    $l1 = strlen($s1);
    $s2 = "aab84093840932bd";
    $l2 = strlen($s2);

    $dis = 0;
    for ($i = 0; $i <= $l2; $i++){
        $p1[$i] = $i;
    }

    for ($i = 0; $i < $l1; $i++){
        $p2[0] = $p1[0] + 1;

        for ($j = 0; $j < $l2; $j++){
            if ($s1[$i] == $s2[$j]){
                $dis = min($p1[$j], $p1[$j + 1] + 1, $p2[$j] + 1);
            }else{
                $dis = min($p1[$j] + 1, $p1[$j + 1] + 1, $p2[$j] + 1);  //  Notice at the end here 1 The parameters for $p2  
            }
            $p2[$j + 1] = $dis;
        }
        $tmp = $p1;
        $p1 = $p2;
        $p2 = $tmp;  
    }

    echo "n";
    echo $p1[$l2];
    echo "n";
    echo levenshtein($s1, $s2);

The above is an optimization for PHP kernel developers to the previous classic DP. The optimization point is to constantly reuse two 1-dimensional arrays, one recording the result of last time and one recording the result of this time. If you assign different values to each of the three operations according to the parameters of PHP, you can change the corresponding 1 into the corresponding value of the operation in the algorithm above. The first parameter of the min function corresponds to the modification, the second parameter corresponds to the deletion of the source sky, and the third parameter corresponds to the addition.

Levenshtein distance instructions
Levenshtein distance was first invented and named after the Russian scientist Vladimir Levenshtein in 1965. You can't spell it, you can call it edit distance. Levenshtein distance can be used:
Spell checking(spell check)
Speech recognition(statement recognition)
DNA analysis(DNA analysis)
Plagiarism detection(plagiarism detection) LD stores distance values using the matrix of mn.


Related articles: