php realizes the in depth analysis of the function of automatically acquiring and generating articles topic keywords

  • 2020-06-03 06:02:53
  • OfStack

Used to write programs 1 straight in to avoid this problem, what tag are requiring the use of the program to the input, for some lazy people and for the sake of application experience, is the hope can be automatically generated keywords, automatically access tag articles similar functionality, this time in order to meet the new project, so I just dig in 1 night, 1 was studied under this feature.
There are about three steps to achieve automatic keyword retrieval
1. The title and content are separated by word segmentation algorithm to extract keywords and frequency. The two main algorithms are ICTCLAS and Hidden Markov models of Chinese Academy of Sciences. But both are too high-end, have a certain threshold, and both support C++/JAVA only. There are currently two recommended PSCWS and HTTPCWS based on PHP. SCWS launched 1.0.0 on March 08, 2008, and now the latest version is 1.0.4. PSCWS is its PHP version. HTTPCWS was developed by Zhang, formerly known as PHPCWS. PHPCWS first used API, "ICTCLAS 3.0 Shared Chinese word segmentation Algorithm" for the first time, then used the self-written "reverse Maximum matching algorithm" for word segmentation and word combination processing, and added the function of punctuation filter to get word segmentation results. Unfortunately, Linux is currently only supported and has not been ported to the win platform.
2. Compared the extracted results with the existing thesaurus, processed and removed the useless words to get the keywords that most accord with the rules. The main thing here is to look at the thesaurus. We can define the thesaurus ourselves or use the existing mature thesaurus. For example, sina and netease blogs have this function. They should have a good thesaurus, because they are big websites, and I, a small programmer, can't get any authoritative thesaurus, so I have to start with the existing open source programs and look at their thesaurus.
3. Select the appropriate keyword as the final keyword in the processed extraction results to obtain the keyword most consistent with the current content. At this stage, it is a case by case analysis. Is at most. The current PHP class CMS has its own system for extracting keywords.
At present, the most widely distributed on the network is DEDECMS word source code, I did the test, found that the rather stupid, the effect is very bad. It first set 1 keyword length, determine the number of keywords to obtain, and then take the word, it believes that the title of the word is the required keywords, in addition to read keywords from the body only to reach the length of this set, is the final keyword. In addition, meaningless words like "we" are not removed and listed too frequently. Sometimes HTML is even put forward as a key word, which needs to be improved. But as a helper, it's pretty good. discuz is a little better, but discuz does not provide the source code, just an online api.
There are several versions of THE dede participle, the best of which is the latest version. The frequency of the participle is as follows: Compare the results of dede5.7 and api
Test examples:
$title="THINKPHP will soon stop supporting version 2.0 ";
$body=" to better develop, maintain and support the ThinkPHP framework, the official announcement of s's maintenance and support for 2.0 and previous versions from May 1, 2012, in order to save energy and low carbon, also cancel the corresponding version and document download of the official website.
In memory of those years, once developed ThinkPHP version of it!
About ThinkPHP 2.0
ThinkPHP was born in 2006, committed to the rapid development of WEB application, its 2.0 release in October 1, 2009, in the previous 1. * version complete new refactoring and leap, was once a landmark version, laid the foundation for the new, also accumulated more users and web site, with the fast update of frame, and the new released version 2.1, 2.2 and 3.0, adumbrative ThinkPHP 3.0 era, 2.0 the life cycle of an end. But basically, many of the features of 2.0 have been extended or improved to 2.1, and upgrading from 2.0 to 2.1 and 2.2 is relatively easy. Version 2.2 is the final version of version 2.*. It will no longer update the function, only do BUG repair." ;
1. dede participle
Sort the results as follows
� � Array (
[THINKPHP] = > 1
[Official] = > 1
[soon] = > 1
[Stop] = > 1
[to] = > 1
[2.0] = > 1
[Version] = > 1
[the] = > 1
Support = > 1
)
Content Array (
[Version] = > 12
[the] = > 12
[and] = > 8
[ThinkPHP] = > 5
[2.0] = > 5
[also] = > 3
[2.2] = > 3
[2.1] = > 3
[Development] = > 3
[3.0] = > 2
[is] = > 2
Fast] = > 2
[to] = > 2
[Publish] = > 2
[Maintenance] = > 2
[before] = > 2
[a] = > 2
[New] = > 2
Support = > 2
Frame = > 2
[Meanwhile] = > 2
[from] = > 2
How do you extract the keywords that are ultimately needed for this? The preliminary idea is to remove "of", "some" these words, and then in accordance with the content of the sort order, one by one to see whether appear in the title is needed, so that you can take out 1 quantitative word is the final keyword. As the result above, we can get
Version thinkphp 2.0 support stopped
Five key words. It seems that the result is acceptable.
2. In the case of discuz, one xml document was obtained by using api, and the key words were obtained after parsing
Of, fast, version upgrade, development, user
Five words, the first is "of"...
By comparing these two methods, it is found that the subsequent processing of dede+ is close to the content of the document, which should be slightly better than that of discuz, while discuz deviates from the topic of the article, but its selected words have a certain popularity

Related articles: