Python Method of comparing text similarity of difflib Levenshtein

  • 2020-12-18 01:52:24
  • OfStack

The recent work requires sequence matching and similarity detection. However, it is a little complicated that the input length is not fixed, for example:


input_and_output = [1, 2, ' hello ',  The world ', 12.34, 45.6, -21, ' China ', ' beautiful ']

Among them, the need to choose a fixed length of 1 piece from input_and_output as input, and the order, then go to comparing with general, find out the most, is started for numerical character encoding, but later due to more and more Chinese characters, hence to abandon the method, turned to other ways to find the information found two python packages widely recommended, from below each have advantages and disadvantages, the ~ of record

1, difflib


import difflib #python  With your own library, no additional installation is required 

In [49]: test1
Out[49]: [' hello ', ' Who am I ']

In [50]: test2
Out[50]: [' Hello ', ' I who ']

In [51]: test3
Out[51]: [12, 'nihao']

In [52]: test4
Out[52]: [' hello ', 'woshi']

In [53]: difflib.SequenceMatcher(a=test1, b=test2).quick_ratio()
Out[53]: 0.0

In [54]: difflib.SequenceMatcher(a=test1, b=test4).ratio()
Out[54]: 0.5

2, Levenshtein


#pip install python-Levenshtein

import Levenshtein


In [56]: Levenshtein.distance(','.join(test1), ','.join(test2))
Out[56]: 2

In [57]: Levenshtein.distance(','.join(test1), ','.join(test4))
Out[57]: 5

To put it simply, difflib is not used as a string, but matches count only when a single element matches perfectly,

Levenshtein, on the other hand, requires input as a string and matches as a whole (it may also be related to grouping all elements into one string, which will be used later).


Related articles: