Simple Implementation of Python Chinese Error Correction

2021-11-14 06:13:21
OfStack

Introduction

This article mainly uses Python to realize simple homophone error correction of Chinese word segmentation. In the current case, only one word is allowed to be wrong. If you are interested, you can continue to optimize it. The specific steps are as follows:

Prepare a file first, put a Chinese word segmentation in every 1 line inside, my file here is the/Users/wys/Desktop/token. txt in the following code, you can change it to yourself, and then run the code We will construct a prefix tree class to insert all standard word segmentation into the prefix tree and realize a search function to search word segmentation Find out 10 homophones for each word in the input wrong word segmentation, and replace each word with 10 homophones. As a result, you can get n*10 segmentation words at most, and n is the length of segmentation words, because some sounds may not have 10 homophones. These segmented words are searched by prefix tree, and if they can be searched, they will be returned as correct corrections

Code


import re,pinyin
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

class corrector():
    def __init__(self):
        self.re_compile = re.compile(r'[\u4e00-\u9fff]')
        self.DAG = DefaultDagParams()

    #  Read the words in the file 
    def getData(self):
        words = []
        with open("/Users/wys/Desktop/token.txt") as f:
            for line in f.readlines():
                word = line.split(" ")[0]
                if word and len(word) > 2:
                    res = self.re_compile.findall(word)
                    if len(res) == len(word): ##  Ensure that they are all participles composed of Chinese characters 
                        words.append(word)
        return words

    #  Convert each pinyin into homophones  10  Candidate Chinese characters, 
    def pinyin_2_hanzi(self, pinyinList):
        result = []
        words = dag(self.DAG, pinyinList, path_num=10)
        for item in words:
            res = item.path  #  Conversion result 
            result.append(res[0])
        return result

    #  Obtain the candidate result of word transformation 
    def getCandidates(self, phrase):
        chars = {}
        for c in phrase:
            chars[c] = self.pinyin_2_hanzi(pinyin.get(c, format='strip', delimiter=',').split(','))
        replaces = []
        for c in phrase:
            for x in chars[c]:
                replaces.append(phrase.replace(c, x))
        return set(replaces)

    #  Get the correct result after error correction 
    def getCorrection(self, words):
        result = []
        for word in words:
            for word in self.getCandidates(word):
                if Tree.search(word):
                    result.append(word)
                    break
        return result

class Node:
    def __init__(self):
        self.word = False
        self.child = {}


class Trie(object):
    def __init__(self):
        self.root = Node()

    def insert(self, words):
        for word in words:
            cur = self.root
            for w in word:
                if w not in cur.child:
                    cur.child[w] = Node()
                cur = cur.child[w]

            cur.word = True

    def search(self, word):
        cur = self.root
        for w in word:
            if w not in cur.child:
                return False
            cur = cur.child[w]

        if cur.word == False:
            return False
        return True

if __name__ == '__main__':
    #  Initialize corrector 
    c = corrector()
    #  Acquisition of Words 
    words = c.getData()
    #  Initialize prefix tree 
    Tree = Trie()
    #  Insert all words into the prefix tree 
    Tree.insert(words)
    #  Test 
    print(c.getCorrection([' Zhuantang Street ',' Turn Tong Sister Road ',' Turn Tong Street to ']))

Results

The printed result is:
['Zhuantang Street', 'Zhuantang Street', 'Zhuantang Street']

It can be seen that all of them have been corrected successfully, with a certain effect, and then they will continue to be optimized.