Briefly describe the four word segmentation tools of python which one is better to use?
- 2021-10-25 07:35:57
- OfStack
2. pkuseg segmentation
3. FoolNLTK segmentation
4. THULAC
Hello, everyone, I'm Ango!
Word segmentation has a common scene in natural language processing. For example, word segmentation tools are needed to automatically extract keywords from an article, and Chinese search field is also inseparable from word segmentation
There are many open source word segmentation tools in Python. Here are some common word segmentation dependencies
1. jieba participle
"Stuttering" word segmentation, GitHub most popular word segmentation tool, determined to do the best Python Chinese word segmentation components, support a variety of word segmentation mode, support custom dictionary
github star: 26k
Code Sample
import jieba
strs=[" I came to Tsinghua University in Beijing "," The table tennis auction is over "," University of Science and Technology of China "]
for str in strs:
seg_list = jieba.cut(str,use_paddle=True) # Use paddle Mode
print("Paddle Mode: " + '/'.join(list(seg_list)))
seg_list = jieba.cut(" I came to Tsinghua University in Beijing ", cut_all=True)
print(" Full mode : " + "/ ".join(seg_list)) # Full mode
seg_list = jieba.cut(" I came to Tsinghua University in Beijing ", cut_all=False)
print(" Accurate mode : " + "/ ".join(seg_list)) # Accurate mode
seg_list = jieba.cut(" He came to Netease Hangyan Building ") # The default is exact mode
print(" New word recognition: ", ",".join(seg_list))
seg_list = jieba.cut_for_search(" Xiao Ming graduated from the Institute of Computing Science, Chinese Academy of Sciences, and then studied at Kyoto University, Japan ") # Search engine mode
print(" Search engine mode :", ".join(seg_list))
Output:
"Full mode" : I / Come to / Beijing / Tsinghua / Tsinghua University / Huada / University
"Precise Mode" : I / Come to / Beijing / Tsinghua University
"New Word Recognition": He , Come to , It's over , NetEase , Hang Yan , Mansion ( Here, "Hangyan" is not in the dictionary, but it is also Viterbi The algorithm recognized it )
"Search engine mode": Xiao Ming , Master's degree , Graduation , In , China , Science , College , Academy of Sciences , Chinese Academy of Sciences , Calculation , Institute of Computing , Posterior , In , Japan , Kyoto , University , Kyoto University, Japan , Further study
Project address:
https://github.com/fxsjy/jieba
2. pkuseg segmentation
pkuseg is an open source word segmentation tool of Peking University Language Computing and Machine Learning Research Group
It is characterized by supporting multi-domain word segmentation. At present, it supports pre-training models of word segmentation in news field, network field, medicine field, tourism field and mixed field, and users can freely choose different models
Compared with general word segmentation tools, its word segmentation accuracy is higher
github star: 5.4 k
Code Sample
import pkuseg
seg = pkuseg.pkuseg() # Load the model with default configuration
text = seg.cut('python Yes 1 A great language ') # Progressive segmentation
print(text)
Output
['python', ' Yes ', '1', ' Door ', ' Very ', ' Rod ', ' Adj. ', ' Language ']
Project address:
https://github.com/lancopku/pkuseg-python
3. FoolNLTK segmentation
Based on the training of BiLSTM model, it is said that it may be the most accurate open source Chinese word segmentation, and also supports user-defined dictionaries
GitHub star: 1.6k
Code Sample
import fool
text = "1 A fool in Beijing "
print(fool.cut(text))
# ['1 A ', ' Fool ', ' In ', ' Beijing ']
Project address:
https://github.com/rockyzhengwu/FoolNLTK
4. THULAC
THULAC is a set of Chinese lexical analysis toolkit developed by Tsinghua University Natural Language Processing and Social Humanities Computing Laboratory
It has the function of part-of-speech tagging, and can analyze whether a word is a noun, a verb or an adjective
github star: 1.5 k
Code Sample
import thulac
thu1 = thulac.thulac() # Default mode
text = thu1.cut(" I love Tiananmen Square in Beijing ", text=True) # Go on 1 Sentence segmentation
print(text) # I _r Love _v Beijing _ns Tian'anmen Square _ns
Code Sample 2
thu1 = thulac.thulac(seg_only=True) # Only word segmentation, not part-of-speech tagging
thu1.cut_f("input.txt", "output.txt") # Right input.txt The content of the file is segmented and output to output.txt
Project address:
https://github.com/thunlp/THULAC-Python
At present, I am still using stuttering segmentation, which cooperates with user-defined dictionaries to solve common network words
What word segmentation tools are you using? Please leave your comments
The above is a brief description of python 4 word segmentation tools. Which one is better to use? For more information about python word segmentation tools, please pay attention to other related articles on this site!