The Python Trie tree implements dictionary sort

  • 2020-04-02 13:29:51
  • OfStack

Common languages provide lexicographical sort apis, such as lexicographical sort when docking with the WeChat public platform. There are many algorithms for sorting by dictionary. The easiest one to think of is string search, but this is cumbersome and not very good. Trie tree is a very common tree structure, it is widely used in various aspects, such as string retrieval, Chinese word segmentation, the longest string public prefix and dictionary sort, and so on, and in the input method can also see the Trie tree figure.


What is a Trie tree

Trie trees, also known as dictionary trees, word lookup trees, or prefix trees, are a multi-tree structure for quick retrieval. The dictionary of the figure is a 10-fork tree:

< img border = 0 SRC = "/ / files.jb51.net/file_images/article/201403/201403282342282.jpg" >

Similarly, the dictionary number of lowercase or uppercase English letters is a 26-fork tree. As can be seen from the figure above, the root node of the Trie tree does not store data, and all data is stored in its child node. There are strings go, golang, PHP, python, perl, and its Trie tree can be constructed as shown in the following figure:

< img border = 0 SRC = "/ / files.jb51.net/file_images/article/201403/201403282342283.jpg" >

So let's look at the top picture. Except for the root node, each child node stores only one character. Go and golang share the go prefix, while PHP, perl, and python share only the p prefix. In order to implement lexicographical sort, the characters stored on each layer of nodes are stored in lexicographical sort (depending on how they are traversed). Let's first look at how to lexicographically sort individual characters. This article only considers lowercase letters, otherwise similar. 'a' comes before 'b', and 'a' has less ASCII than 'b', so you can get the dictionary order by subtracting their ASCII. And python has built-in lexicographical sort apis, such as:


#!/usr/bin/env python
#coding: utf8
if __name__ == '__main__':
 arr = [c for c in 'python']
 arr.sort()
 print arr

It can also be implemented using bitmap, which I introduced in a previous article: (link: #). The implementation code is as follows:


#!/usr/bin/env python
#coding: utf8
class Bitmap(object):
 def __init__(self, max):
  self.size  = self.calcElemIndex(max, True)
  self.array = [0 for i in range(self.size)]
 def calcElemIndex(self, num, up=False):
  '''up for True Is rounded up ,  Otherwise, it is rounded down '''
  if up:
   return int((num + 31 - 1) / 31) # Take up the whole 
  return num / 31
 def calcBitIndex(self, num):
  return num % 31
 def set(self, num):
  elemIndex = self.calcElemIndex(num)
  byteIndex = self.calcBitIndex(num)
  elem      = self.array[elemIndex]
  self.array[elemIndex] = elem | (1 << byteIndex)
 def clean(self, i):
  elemIndex = self.calcElemIndex(i)
  byteIndex = self.calcBitIndex(i)
  elem      = self.array[elemIndex]
  self.array[elemIndex] = elem & (~(1 << byteIndex))
 def test(self, i):
  elemIndex = self.calcElemIndex(i)
  byteIndex = self.calcBitIndex(i)
  if self.array[elemIndex] & (1 << byteIndex):
   return True
  return False
if __name__ == '__main__':
 MAX = ord('z')
 suffle_array = [c for c in 'python']
 result       = []
 bitmap = Bitmap(MAX)
 for c in suffle_array:
  bitmap.set(ord(c))

 for i in range(MAX + 1):
  if bitmap.test(i):
   result.append(chr(i))
 print ' The original array is :    %s' % suffle_array
 print ' The sorted array is : %s' % result

The bitmap sort cannot have duplicate characters. In fact, there are already many sophisticated algorithms for sorting dictionaries based on ASCII subtraction, such as insertion sort, hill sort, bubble sort and heap sort. For the sake of simplicity, this article USES Python's built-in sorted method to do single-character dictionary sorting. This is also possible if the reader implements the sorting of a single-character array, and this will allow you to customize the sorting of strings.

Implementation approach

The entire implementation consists of two classes: the Trie class and the Node class. The Node class represents the nodes in the Trie tree and is organized by the Trie class into a Trie tree. Let's start with the Node class:


#!/usr/bin/env python
#coding: utf8
class Node(object):
 def __init__(self, c=None, word=None):
  self.c          = c    #  A single character stored by a node 
  self.word       = word #  The node stores the words 
  self.childs     = []   #  Child of this node 

Node contains three member variables. C is the character stored on each node. Word means a complete word, in this case a string. Childs contains all children of this node. Since c is stored in each node, what is the use of storing word? And which node should this word be on? Let's use the same example: go and golang, for example, share the go prefix, but it's a good idea to do a string search, because it will provide the original string, just search the Trie tree by the path. However, for sorting, no input is provided, so there is no way to know where the word boundary is, and word in the Node class is the word boundary. Specifically, it is stored on the last node of the word, as shown in the figure:

< img border = 0 SRC = "/ / files.jb51.net/file_images/article/201403/201403282342284.jpg" >

The c member of the Node class can be undefined if the tree is not searched, because it is not required in the sort.

Next let's look at the definition of the Trie class:


#!/usr/bin/env python
#coding: utf8
'''Trie Tree implements string array dictionary sort '''
class Trie(object):
 def __init__(self):
  self.root  = Node() # Trie The tree root Node reference 
 def add(self, word):
  ''' Add string '''
  node = self.root
  for c in word:
   pos = self.find(node, c)
   if pos < 0:
    node.childs.append(Node(c))
    # For the sake of simplicity, use it directly here Python The built-in sorted To sort 
    #pos Question, because sort After the pos Get off , So again find To get real pos
    # Custom single character array sort way can achieve arbitrary rules of the string array sort 
    node.childs = sorted(node.childs, key=lambda child: child.c)
    pos = self.find(node, c)
   node = node.childs[pos]
  node.word = word
 def preOrder(self, node):
  ''' First order output '''
  results = []
  if node.word:
   results.append(node.word)
  for child in node.childs:
   results.extend(self.preOrder(child))
  return results
 def find(self, node, c):
  ''' Find the position where the character is inserted '''
  childs = node.childs
  _len   = len(childs)
  if _len == 0:
   return -1
  for i in range(_len):
   if childs[i].c == c:
    return i
  return -1
 def setWords(self, words):
  for word in words:
   self.add(word)

Trie contains 1 member variable and 4 methods. Root is used to refer to a root node. It does not store specific data, but it has child nodes. The setWords method is used for initialization, and the add method is called to initialize the Trie tree on a per-string basis. The add method adds each character to the child node, shares it if it exists and looks for the next child node, and so on. Find is used to find if a child node has been established to store a character, while preOrder is used to get the stored word in order first. There are three ways to traverse a tree: in order, in order, and in order, if you don't understand, you can Google to understand. Let's test it out:


#!/usr/bin/env python
#coding: utf8
'''Trie Tree implements string array dictionary sort '''
class Trie(object):
 def __init__(self):
  self.root  = Node() # Trie The tree root Node reference 
 def add(self, word):
  ''' Add string '''
  node = self.root
  for c in word:
   pos = self.find(node, c)
   if pos < 0:
    node.childs.append(Node(c))
    # For the sake of simplicity, use it directly here Python The built-in sorted To sort 
    #pos Question, because sort After the pos Get off , So again find To get real pos
    # Custom single character array sort way can achieve arbitrary rules of the string array sort 
    node.childs = sorted(node.childs, key=lambda child: child.c)
    pos = self.find(node, c)
   node = node.childs[pos]
  node.word = word
 def preOrder(self, node):
  ''' First order output '''
  results = []
  if node.word:
   results.append(node.word)
  for child in node.childs:
   results.extend(self.preOrder(child))
  return results
 def find(self, node, c):
  ''' Find the position where the character is inserted '''
  childs = node.childs
  _len   = len(childs)
  if _len == 0:
   return -1
  for i in range(_len):
   if childs[i].c == c:
    return i
  return -1
 def setWords(self, words):
  for word in words:
   self.add(word)
class Node(object):
 def __init__(self, c=None, word=None):
  self.c          = c    #  A single character stored by a node 
  self.word       = word #  The node stores the words 
  self.childs     = []   #  Child of this node 
if __name__ == '__main__':
 words = ['python', 'function', 'php', 'food', 'kiss', 'perl', 'goal', 'go', 'golang', 'easy']
 trie = Trie()
 trie.setWords(words)
 result = trie.preOrder(trie.root)
 print ' Raw string array :     %s' % words
 print 'Trie Tree sorted :       %s' % result
 words.sort()
 print 'Python the sort After ordering : %s' % words

conclusion

There are so many kinds of trees. In the implementation of tree structure, tree traversal is difficult and requires more practice. The above code was written in a hurry, without any optimization, but on this basis you can achieve any kind of string sorting, string search, and so on.


Related articles: