An example of python implementing keyword extraction
- 2020-08-22 22:13:31
- OfStack
New kid to write a blog again!! No one is unhappy ~~( > _ < )~~
Today I'm going to do a simple keyword extraction code
The extraction of key words in the article content is divided into three steps:
(1) Participle
(2) Stop using words
(3) Keyword extraction
There are many ways of participle, here I choose the common stutter jieba participle; Go stop words. I used a stop words list.
The specific code is as follows:
import jieba
import jieba.analyse
# The first 1 Step: Participle, here use stammer participle full mode
text = ''' News, also known as news, refers to newspapers, radio stations, television stations and the Internet that are often used to record society, disseminate information and reflect The Times 1 It has the characteristics of authenticity, timeliness, conciseness, readability and accuracy. There are broad and narrow definitions of news. In its broad sense, except for commentaries published in newspapers, radio and television, and special articles, all commonly used texts belong to the news category, including messages, communications, features, sketches (some include sketches in features), and so on. In the narrow sense, news refers to news, which is to report the recent and valuable facts at home and abroad in a rapid and timely manner in a concise and concise way. News can also be divided into public news and gossip, etc. In the structure of each story, 1 Generally include title, lead, body, background and conclusion 5 Part. before 3 Is the main part, after 2 Is the auxiliary part. The writing style is mainly narrative, and sometimes there is discussion, description, comment and so on.
'''
fenci_text = jieba.cut(text)
#print("/ ".join(fenci_text))
# The first 2 Step: Stop using words
# Here is a 1 A file containing articles to be changed, 1 Three files are stored in the stop table and compared to the words in the stop table, 1 Delete the sample and save the result in the end 1 A file
stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt') ])
final = ""
for word in fenci_text:
if word not in stopwords:
if (word != " . " and word != " . ") :
final = final + " " + word
print(final)
# The first 3 Step: Extract keywords
a=jieba.analyse.extract_tags(text, topK = 5, withWeight = True, allowPOS = ())
print(a)
#text Is the text to be extracted
# topK: Return to a few TF/IDF The keyword with the largest weight, the default value is 20 .
# withWeight: Whether or not 1 And returns the keyword weight value, which defaults to False .
# allowPOS: Only words of the specified part of speech are included, and the default value is empty, meaning no filtering is performed.
Operation results:
runfile('D:/Data/ Text mining /xiaojieba.py', wdir='D:/Data/ Text mining ')
news The message Refers to the The newspaper , radio , television , The Internet record social , spread information , era 1 Kind of style authenticity , timeliness , simplicity , readability , accuracy news concept The generalized A narrow The points of The generalized published Newspapers and periodicals , radio , TV comments panel outside The commonly used The text news column including The message , communication , A close-up , sketch ( sketch Included in the A close-up column ) A narrow news Specifically to The message The message summary The narrative way concise The text reports At home and abroad The newly happen , The value of FACTS news points The public news trail news Each is news On the structure including The title , Introduction: , The main body , background conclusion 5 before 3 those 2 those auxiliary writing The narrative both Talk about , description , comments
[(' news ', 0.4804811569680808), (' sketch ', 0.2121107125313131), (' The message ', 0.20363211136040404), (' A close-up ', 0.20023623445272729), (' A narrow ', 0.16168734917858588)]
Okay, isn't that easy?