python full text search engine details

  • 2020-05-30 20:33:00
  • OfStack

python full text search engine details

Recently, I have been exploring how to use Python to realize keyword search function like baidu. When it comes to keyword retrieval, we can't help but think of regular expressions. Regular expressions are the basis of all retrieval, and python has an re class that is dedicated to regular matching. However, the light is a regular expression is not very good to achieve the retrieval function.

python has an whoosh package that is dedicated to full-text search engines.

whoosh is used less in China, and its performance is not as mature as sphinx/coreseek, but unlike the former, it is a pure python library, which is more convenient for fans of python. The code is as follows

The installation

Enter the command line pip install whoosh

Packages to import include:


fromwhoosh.index import create_in

fromwhoosh.fields import *

fromwhoosh.analysis import RegexAnalyzer

fromwhoosh.analysis import Tokenizer,Token

Chinese word segmentation parser


class ChineseTokenizer(Tokenizer):
  """
   Chinese word segmentation parser 
  """
  def __call__(self, value, positions=False, chars=False,
         keeporiginal=True, removestops=True, start_pos=0, start_char=0,
         mode='', **kwargs):
    assert isinstance(value, text_type), "%r is not unicode "% value
    t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
    list_seg = jieba.cut_for_search(value)
    for w in list_seg:
      t.original = t.text = w
      t.boost = 0.5
      if positions:
        t.pos = start_pos + value.find(w)
      if chars:
        t.startchar = start_char + value.find(w)
        t.endchar = start_char + value.find(w) + len(w)
      yield t


def chinese_analyzer():
  return ChineseTokenizer()

The function that builds the index


@staticmethod
  def create_index(document_dir):
    analyzer = chinese_analyzer()
    schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),
            content=TEXT(stored=True, analyzer=analyzer))
    ix = create_in("./", schema)
    writer = ix.writer()
    for parents, dirnames, filenames in os.walk(document_dir):
      for filename in filenames:
        title = filename.replace(".txt", "").decode('utf8')
        print title
        content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')
        path = u"/b"
        writer.add_document(titel=title, path=path, content=content)
    writer.commit()

Retrieval function


 @staticmethod
  def search(search_str):
    title_list = []
    print 'here'
    ix = open_dir("./")
    searcher = ix.searcher()
    print search_str,type(search_str)
    results = searcher.find("content", search_str)
    for hit in results:
      print hit['titel']
      print hit.score
      print hit.highlights("content", top=10)
      title_list.append(hit['titel'])
    print 'tt',title_list
    return title_list

Thank you for reading, I hope to help you, thank you for your support of this site!


Related articles: