The function code in PHP that calculates string similarity

2020-05-27 04:33:49
OfStack

similar_text - calculates the similarity between two strings
int similar_text ( string $first , string $second [, float & $percent ] )
$first required. Specifies the first string to compare.
$second required. Specifies the second string to compare.
$percent optional. Specifies the variable name for storing a percentage similarity.

The similarity between the two strings is calculated according to the description of Oliver [1993]. Note that the implementation does not use the stack in the Oliver virtual code, but does make recursive calls, which can slow or speed the entire process. Also note that the complexity of the algorithm is O(N**3), and N is the length of the longest string.

For example, we want to find the similarity between the string abcdefg and the string aeg:

 

$first = "abcdefg"; 

$second = "aeg"; 

echo similar_text($first, $second); Results output 3. If you want to display it as a percentage, you can use its first 3 A parameter , As follows:  

$first = "abcdefg"; 

$second = "aeg"; 

similar_text($first, $second, $percent); 

echo $percent;

The use and implementation of similar_text function. The similar_text() function is primarily used to calculate the number of matching characters in two strings, but it can also calculate the similarity (in percentage) between two strings. Today's levenshtein() function is faster than the similar_text() function. However, the similar_text() function can provide more precise results with fewer required changes. Consider using the levenshtein() function when you're looking for speed with little precision and a finite string length.

Directions for use

Read the instructions for the levenshtein() function in the manual first:

The levenshtein() function returns the Levenshtein distance between two strings.

The Levenshtein distance, also known as edit distance, refers to the minimum number of edits required between two strings to convert from one to the other. The permitted edit operations include replacing one character with another, inserting one character, and deleting one character.

For example, convert kitten to sitting:

sitten (k - s)
sittin (e - i)
The sitting (→g) levenshtein() function gives the same weight to each operation (replace, insert, and delete). However, you can define the cost of each operation by setting the optional insert, replace, delete parameters.

Grammar:

levenshtein(string1,string2,insert,replace,delete)

Parameters to describe

The & # 8226; string1 required. The first string to compare.
The & # 8226; string2 required. The second string to compare.
The & # 8226; insert optional. The cost of inserting 1 character. The default is 1.
The & # 8226; replace optional. The cost of replacing 1 character. The default is 1.
The & # 8226; delete optional. The cost of deleting 1 character. The default is 1.
Hints and comments

The & # 8226; If one of the strings exceeds 255 characters, the levenshtein() function returns -1.
The & # 8226; The levenshtein() function is case insensitive.
The & # 8226; The levenshtein() function is faster than the similar_text() function. However, the similar_text() function provides more precise results that require fewer modifications.
example

 

<?php 

echo levenshtein("Hello World","ello World"); 

echo "<br />"; 

echo levenshtein("Hello World","ello World",10,20,30); 

?>

Output: 1 30

The following is a supplement:

php has a default function, similar_text(), to calculate the similarity between strings, and it can also calculate the similarity (in percentage) between two strings. However, this function feels very inaccurate for Chinese calculation, such as:



echo similar_text(" Jilin poultry company fire has been caused 112 People have been killed "," Jilin baoyuanfeng poultry company fire has been caused 112 People have been killed ");

In fact, both of these news headlines are like 1. If you use similar_text(), the result of the similarity pair is: 42, which is only similar to 42%. Therefore, it feels quite unreliable.


<?php 
class LCS {
  var $str1;
  var $str2;
  var $c = array();
  /* Returns the string 1 And string 2 Longest common subsequence of 
*/
  function getLCS($str1, $str2, $len1 = 0, $len2 = 0) {
    $this->str1 = $str1;
    $this->str2 = $str2;
    if ($len1 == 0) $len1 = strlen($str1);
    if ($len2 == 0) $len2 = strlen($str2);
    $this->initC($len1, $len2);
    return $this->printLCS($this->c, $len1 - 1, $len2 - 1);
  }
  /* Returns the similarity between two strings 
*/
  function getSimilar($str1, $str2) {
    $len1 = strlen($str1);
    $len2 = strlen($str2);
    $len = strlen($this->getLCS($str1, $str2, $len1, $len2));
    return $len * 2 / ($len1 + $len2);
  }
  function initC($len1, $len2) {
    for ($i = 0; $i < $len1; $i++) $this->c[$i][0] = 0;
    for ($j = 0; $j < $len2; $j++) $this->c[0][$j] = 0;
    for ($i = 1; $i < $len1; $i++) {
      for ($j = 1; $j < $len2; $j++) {
        if ($this->str1[$i] == $this->str2[$j]) {
          $this->c[$i][$j] = $this->c[$i - 1][$j - 1] + 1;
        } else if ($this->c[$i - 1][$j] >= $this->c[$i][$j - 1]) {
          $this->c[$i][$j] = $this->c[$i - 1][$j];
        } else {
          $this->c[$i][$j] = $this->c[$i][$j - 1];
        }
      }
    }
  }
  function printLCS($c, $i, $j) {
    if ($i == 0 || $j == 0) {
      if ($this->str1[$i] == $this->str2[$j]) return $this->str2[$j];
      else return "";
    }
    if ($this->str1[$i] == $this->str2[$j]) {
      return $this->printLCS($this->c, $i - 1, $j - 1).$this->str2[$j];
    } else if ($this->c[$i - 1][$j] >= $this->c[$i][$j - 1]) {
      return $this->printLCS($this->c, $i - 1, $j);
    } else {
      return $this->printLCS($this->c, $i, $j - 1);
    }
  }
} 

$lcs = new LCS();
// Returns the longest common subsequence 
$lcs->getLCS("hello word","hello china");
// Return similarity 
echo $lcs->getSimilar(" Jilin poultry company fire has been caused 112 People have been killed "," Jilin baoyuanfeng poultry company fire has been caused 112 People have been killed ");

The same output result is: 0.90322580645161, significantly more accurate.