jieba Thesaurus in python

  • 2021-12-13 08:55:40
  • OfStack

Catalog 1, jieba Library Installation 2, jieba Library Function Introduction 3, Case 3.1, Accurate Mode 3.2, Full Mode 3.3, Search Engine Mode 3.4, Modified Dictionary 3.5, Part of Speech Tagging 3.6, Statistics of the Number of Characters Appearing in the Romance of the Three Countries jieba Library is an excellent third-party library for Chinese word segmentation, and Chinese texts need to obtain a single word through word segmentation

1. jieba Library Installation

Run the cmd window input command as an administrator: pip install jieba

2. Introduction of jieba library function

Characteristics: Support 3 word segmentation modes: Precise pattern: Try to cut sentences most accurately, which is suitable for text analysis Full mode: Scan all the words in the sentence that can be used as words, which is very fast, but can't solve ambiguity Search engine mode: On the basis of accurate mode, long words are segmented again to improve recall rate, which is suitable for search engine word segmentation
Support traditional Chinese word segmentation Support custom dictionaries Word segmentation function: The jieba. cut and jieba. lcut methods accept two incoming parameters: The first parameter is the string that needs word segmentation The cut_all parameter is used to control whether full mode is adopted
lcut converts the returned object to list object return
The jieba.cut_for_search and jieba.lcut_for_search methods accept one parameter String requiring word segmentation
This method is suitable for word segmentation of inverted index constructed by search engine, and its granularity is fine The jieba. lcut_for_search method returns a list type
Add a custom dictionary: Developers can specify their own custom dictionaries to include words not found in the jieba thesaurus. Although jieba has the ability to recognize new words, adding new words by itself can ensure higher accuracy Usage: To use a custom dictionary file: jieba. load_userdict (file_name) # file_name is the path to the custom dictionary Use jieba to dynamically modify dictionaries in programs: jieba.add_word (new_words) # new_words is the new word you want to add jieba. del_word (words) # Delete words Keyword extraction: jieba. analyse.extract_tags (sentence, topK) # requires import jieba. analyse first sentence is the text to be extracted topK returns several keywords with the largest weight of TF/IDF, and the default is 20 Part-of-speech tagging: jieba. posseg. POSTokenizer (tokenizer=None) Create a new custom word segmentation. The tokenizer parameter can specify the jieba. Tokenizer word segmentation used internally
jieba. posseg. dt is the default part-of-speech tagging word segmentation The part-of-speech of each word after sentence segmentation is marked, and the marking method compatible with ictclas is adopted

3. Cases

3.1. Accurate Mode

import jieba
list1 = jieba.lcut(" The People's Republic of China is 1 A great country ")
print(" Accurate mode: "+"/".join(list1))

3.2. Full mode

list2 = jieba.lcut(" The People's Republic of China is 1 A great country ",cut_all = True)
print(" Full mode: "+"/".join(list2))

3.3. Search engine mode

list3 = jieba.lcut_for_search(" The People's Republic of China is 1 A great country ")
print(" Search engine mode: "+"  ".join(list3))

3.4. Revision of the Dictionary

import jieba
text = " CITIC Jiantou Investment Company 1 CITIC also invested in the game 1 A game company "
word = jieba.lcut(text)
#  Adding words 
jieba.add_word(" CITIC Construction Investment ")
jieba.add_word(" Investment company ")
word1 = jieba.lcut(text)
#  Delete words 
jieba.del_word(" CITIC Construction Investment ")
word2 = jieba.lcut(text)

3.5. Part-of-speech tagging

import jieba.posseg as pseg
words = pseg.cut(" I love Tiananmen Square in Beijing ")
for i in words:

3.6. Count the number of appearances of characters in the romance of the three countries

3 Romance text download:

import  jieba
txt = open(" File path ", "r", encoding='utf-8').read()    #  Open and read the file 
words = jieba.lcut(txt)     #  Word segmentation of text using precise patterns 
counts = {}     #  Storing words and their occurrence times in the form of key-value pairs 
for word in words:
    if  len(word) == 1:    #  Single words are not counted 
        counts[word] = counts.get(word, 0) + 1    #  Traverse all words, and every time they appear, 1 Next to its corresponding value plus  1   
items = list(counts.items())     # Convert key-value pairs into lists 
items.sort(key=lambda x: x[1], reverse=True)    #  Sort words from big to small according to the number of times they appear  
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

import jieba
excludes = {" General "," Said "," Jingzhou ","2 People "," Can't "," Can't "," So "," How "}
txt = open("3 Romance of the Kingdom .txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
    elif word == " Zhuge Liang " or word == " Kong Ming Yue ":
        rword = " Kong Ming "
    elif word == " Guan Gong " or word == " Cloud length ":
        rword = " Guan Yu "
    elif word == " Xuande " or word == " Xuan De Yue ":
        rword = " Liu Bei "
    elif word == " Meng De " or word == " Prime Minister ":
        rword = " Cao Cao "
        rword = word
        counts[rword] = counts.get(rword,0) + 1
for i in excludes:
    del counts[i]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count)) 

Related articles: