Briefly describe the four word segmentation tools of python which one is better to use？

2021-10-25 07:35:57
OfStack

Directory 1. jieba word segmentation
2. pkuseg segmentation
3. FoolNLTK segmentation
4. THULAC

Hello, everyone, I'm Ango!

Word segmentation has a common scene in natural language processing. For example, word segmentation tools are needed to automatically extract keywords from an article, and Chinese search field is also inseparable from word segmentation

There are many open source word segmentation tools in Python. Here are some common word segmentation dependencies

1. jieba participle

"Stuttering" word segmentation, GitHub most popular word segmentation tool, determined to do the best Python Chinese word segmentation components, support a variety of word segmentation mode, support custom dictionary

github star: 26k

Code Sample


import jieba

strs=[" I came to Tsinghua University in Beijing "," The table tennis auction is over "," University of Science and Technology of China "]

for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) #  Use paddle Mode 
    print("Paddle Mode: " + '/'.join(list(seg_list)))

seg_list = jieba.cut(" I came to Tsinghua University in Beijing ", cut_all=True)
print(" Full mode : " + "/ ".join(seg_list))  #  Full mode 

seg_list = jieba.cut(" I came to Tsinghua University in Beijing ", cut_all=False)
print(" Accurate mode : " + "/ ".join(seg_list))  #  Accurate mode 

seg_list = jieba.cut(" He came to Netease Hangyan Building ")  #  The default is exact mode 
print(" New word recognition: ", ",".join(seg_list))

seg_list = jieba.cut_for_search(" Xiao Ming graduated from the Institute of Computing Science, Chinese Academy of Sciences, and then studied at Kyoto University, Japan ")  #  Search engine mode 
print(" Search engine mode :", ".join(seg_list))

Output:


 "Full mode" :  I /  Come to /  Beijing /  Tsinghua /  Tsinghua University /  Huada /  University 

 "Precise Mode" :  I /  Come to /  Beijing /  Tsinghua University 

 "New Word Recognition": He ,  Come to ,  It's over ,  NetEase ,  Hang Yan ,  Mansion     ( Here, "Hangyan" is not in the dictionary, but it is also Viterbi The algorithm recognized it )

 "Search engine mode":   Xiao Ming ,  Master's degree ,  Graduation ,  In ,  China ,  Science ,  College ,  Academy of Sciences ,  Chinese Academy of Sciences ,  Calculation ,  Institute of Computing ,  Posterior ,  In ,  Japan ,  Kyoto ,  University ,  Kyoto University, Japan ,  Further study

Project address:

https://github.com/fxsjy/jieba

2. pkuseg segmentation

pkuseg is an open source word segmentation tool of Peking University Language Computing and Machine Learning Research Group

It is characterized by supporting multi-domain word segmentation. At present, it supports pre-training models of word segmentation in news field, network field, medicine field, tourism field and mixed field, and users can freely choose different models

Compared with general word segmentation tools, its word segmentation accuracy is higher

github star: 5.4 k

Code Sample


import pkuseg

seg = pkuseg.pkuseg()           #  Load the model with default configuration 
text = seg.cut('python Yes 1 A great language ')  #  Progressive segmentation 
print(text)

Output


['python', ' Yes ', '1', ' Door ', ' Very ', ' Rod ', ' Adj. ', ' Language ']

Project address:

https://github.com/lancopku/pkuseg-python

3. FoolNLTK segmentation

Based on the training of BiLSTM model, it is said that it may be the most accurate open source Chinese word segmentation, and also supports user-defined dictionaries

GitHub star: 1.6k

Code Sample


import fool

text = "1 A fool in Beijing "
print(fool.cut(text))
# ['1 A ', ' Fool ', ' In ', ' Beijing ']

Project address:

https://github.com/rockyzhengwu/FoolNLTK

4. THULAC

THULAC is a set of Chinese lexical analysis toolkit developed by Tsinghua University Natural Language Processing and Social Humanities Computing Laboratory

It has the function of part-of-speech tagging, and can analyze whether a word is a noun, a verb or an adjective

github star: 1.5 k

Code Sample


 import thulac  

 thu1 = thulac.thulac()  # Default mode 
 text = thu1.cut(" I love Tiananmen Square in Beijing ", text=True)  # Go on 1 Sentence segmentation 
 print(text) #  I _r  Love _v  Beijing _ns  Tian'anmen Square _ns
  Code Sample 2
 thu1 = thulac.thulac(seg_only=True)  # Only word segmentation, not part-of-speech tagging 
 thu1.cut_f("input.txt", "output.txt")  # Right input.txt The content of the file is segmented and output to output.txt

Project address:

https://github.com/thunlp/THULAC-Python

At present, I am still using stuttering segmentation, which cooperates with user-defined dictionaries to solve common network words

What word segmentation tools are you using? Please leave your comments

The above is a brief description of python 4 word segmentation tools. Which one is better to use? For more information about python word segmentation tools, please pay attention to other related articles on this site!