python full text search engine details
- 2020-05-30 20:33:00
- OfStack
python full text search engine details
Recently, I have been exploring how to use Python to realize keyword search function like baidu. When it comes to keyword retrieval, we can't help but think of regular expressions. Regular expressions are the basis of all retrieval, and python has an re class that is dedicated to regular matching. However, the light is a regular expression is not very good to achieve the retrieval function.
python has an whoosh package that is dedicated to full-text search engines.
whoosh is used less in China, and its performance is not as mature as sphinx/coreseek, but unlike the former, it is a pure python library, which is more convenient for fans of python. The code is as follows
The installation
Enter the command line pip install whoosh
Packages to import include:
fromwhoosh.index import create_in
fromwhoosh.fields import *
fromwhoosh.analysis import RegexAnalyzer
fromwhoosh.analysis import Tokenizer,Token
Chinese word segmentation parser
class ChineseTokenizer(Tokenizer):
"""
Chinese word segmentation parser
"""
def __call__(self, value, positions=False, chars=False,
keeporiginal=True, removestops=True, start_pos=0, start_char=0,
mode='', **kwargs):
assert isinstance(value, text_type), "%r is not unicode "% value
t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
list_seg = jieba.cut_for_search(value)
for w in list_seg:
t.original = t.text = w
t.boost = 0.5
if positions:
t.pos = start_pos + value.find(w)
if chars:
t.startchar = start_char + value.find(w)
t.endchar = start_char + value.find(w) + len(w)
yield t
def chinese_analyzer():
return ChineseTokenizer()
The function that builds the index
@staticmethod
def create_index(document_dir):
analyzer = chinese_analyzer()
schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),
content=TEXT(stored=True, analyzer=analyzer))
ix = create_in("./", schema)
writer = ix.writer()
for parents, dirnames, filenames in os.walk(document_dir):
for filename in filenames:
title = filename.replace(".txt", "").decode('utf8')
print title
content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')
path = u"/b"
writer.add_document(titel=title, path=path, content=content)
writer.commit()
Retrieval function
@staticmethod
def search(search_str):
title_list = []
print 'here'
ix = open_dir("./")
searcher = ix.searcher()
print search_str,type(search_str)
results = searcher.find("content", search_str)
for hit in results:
print hit['titel']
print hit.score
print hit.highlights("content", top=10)
title_list.append(hit['titel'])
print 'tt',title_list
return title_list
Thank you for reading, I hope to help you, thank you for your support of this site!