An example of python implementing keyword extraction

2020-08-22 22:13:31
OfStack

New kid to write a blog again!! No one is unhappy ~~( > _ < )~~

Today I'm going to do a simple keyword extraction code

The extraction of key words in the article content is divided into three steps:

(1) Participle

(2) Stop using words

(3) Keyword extraction

There are many ways of participle, here I choose the common stutter jieba participle; Go stop words. I used a stop words list.

The specific code is as follows:


import jieba
import jieba.analyse
# The first 1 Step: Participle, here use stammer participle full mode 
text = ''' News, also known as news, refers to newspapers, radio stations, television stations and the Internet that are often used to record society, disseminate information and reflect The Times 1 It has the characteristics of authenticity, timeliness, conciseness, readability and accuracy. There are broad and narrow definitions of news. In its broad sense, except for commentaries published in newspapers, radio and television, and special articles, all commonly used texts belong to the news category, including messages, communications, features, sketches (some include sketches in features), and so on. In the narrow sense, news refers to news, which is to report the recent and valuable facts at home and abroad in a rapid and timely manner in a concise and concise way. News can also be divided into public news and gossip, etc. In the structure of each story, 1 Generally include title, lead, body, background and conclusion 5 Part. before 3 Is the main part, after 2 Is the auxiliary part. The writing style is mainly narrative, and sometimes there is discussion, description, comment and so on. 
'''
fenci_text = jieba.cut(text)
#print("/ ".join(fenci_text))
# The first 2 Step: Stop using words 
# Here is a 1 A file containing articles to be changed, 1 Three files are stored in the stop table and compared to the words in the stop table, 1 Delete the sample and save the result in the end 1 A file 
stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt') ])
final = ""
for word in fenci_text:
  if word not in stopwords:
    if (word != " . " and word != " . ") :
      final = final + " " + word
print(final)
# The first 3 Step: Extract keywords 
a=jieba.analyse.extract_tags(text, topK = 5, withWeight = True, allowPOS = ())
print(a)
#text  Is the text to be extracted 
# topK: Return to a few  TF/IDF  The keyword with the largest weight, the default value is 20 . 
# withWeight: Whether or not 1 And returns the keyword weight value, which defaults to False . 
# allowPOS: Only words of the specified part of speech are included, and the default value is empty, meaning no filtering is performed.

Operation results:


runfile('D:/Data/ Text mining /xiaojieba.py', wdir='D:/Data/ Text mining ')
  news   The message   Refers to the   The newspaper   ,   radio   ,   television   ,   The Internet   record   social   ,   spread   information   ,   era  1 Kind of   style   authenticity   ,   timeliness   ,   simplicity   ,   readability   ,   accuracy   news   concept   The generalized   A narrow   The points of   The generalized   published   Newspapers and periodicals   ,   radio   ,   TV   comments   panel   outside   The commonly used   The text   news   column   including   The message   ,   communication   ,   A close-up   ,   sketch   (   sketch   Included in the   A close-up   column   )   A narrow   news   Specifically to   The message   The message   summary   The narrative   way   concise   The text   reports   At home and abroad   The newly   happen   ,   The value of   FACTS   news   points   The public   news   trail   news   Each is   news   On the structure   including   The title   ,   Introduction:   ,   The main body   ,   background   conclusion  5  before  3 those  2 those   auxiliary   writing   The narrative   both   Talk about   ,   description   ,   comments  
[(' news ', 0.4804811569680808), (' sketch ', 0.2121107125313131), (' The message ', 0.20363211136040404), (' A close-up ', 0.20023623445272729), (' A narrow ', 0.16168734917858588)]

Okay, isn't that easy?