Simple Implementation of Python Chinese Error Correction
- 2021-11-14 06:13:21
- OfStack
Introduction
This article mainly uses Python to realize simple homophone error correction of Chinese word segmentation. In the current case, only one word is allowed to be wrong. If you are interested, you can continue to optimize it. The specific steps are as follows:
Prepare a file first, put a Chinese word segmentation in every 1 line inside, my file here is the/Users/wys/Desktop/token. txt in the following code, you can change it to yourself, and then run the code We will construct a prefix tree class to insert all standard word segmentation into the prefix tree and realize a search function to search word segmentation Find out 10 homophones for each word in the input wrong word segmentation, and replace each word with 10 homophones. As a result, you can get n*10 segmentation words at most, and n is the length of segmentation words, because some sounds may not have 10 homophones. These segmented words are searched by prefix tree, and if they can be searched, they will be returned as correct corrections
Code
import re,pinyin
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag
class corrector():
def __init__(self):
self.re_compile = re.compile(r'[\u4e00-\u9fff]')
self.DAG = DefaultDagParams()
# Read the words in the file
def getData(self):
words = []
with open("/Users/wys/Desktop/token.txt") as f:
for line in f.readlines():
word = line.split(" ")[0]
if word and len(word) > 2:
res = self.re_compile.findall(word)
if len(res) == len(word): ## Ensure that they are all participles composed of Chinese characters
words.append(word)
return words
# Convert each pinyin into homophones 10 Candidate Chinese characters,
def pinyin_2_hanzi(self, pinyinList):
result = []
words = dag(self.DAG, pinyinList, path_num=10)
for item in words:
res = item.path # Conversion result
result.append(res[0])
return result
# Obtain the candidate result of word transformation
def getCandidates(self, phrase):
chars = {}
for c in phrase:
chars[c] = self.pinyin_2_hanzi(pinyin.get(c, format='strip', delimiter=',').split(','))
replaces = []
for c in phrase:
for x in chars[c]:
replaces.append(phrase.replace(c, x))
return set(replaces)
# Get the correct result after error correction
def getCorrection(self, words):
result = []
for word in words:
for word in self.getCandidates(word):
if Tree.search(word):
result.append(word)
break
return result
class Node:
def __init__(self):
self.word = False
self.child = {}
class Trie(object):
def __init__(self):
self.root = Node()
def insert(self, words):
for word in words:
cur = self.root
for w in word:
if w not in cur.child:
cur.child[w] = Node()
cur = cur.child[w]
cur.word = True
def search(self, word):
cur = self.root
for w in word:
if w not in cur.child:
return False
cur = cur.child[w]
if cur.word == False:
return False
return True
if __name__ == '__main__':
# Initialize corrector
c = corrector()
# Acquisition of Words
words = c.getData()
# Initialize prefix tree
Tree = Trie()
# Insert all words into the prefix tree
Tree.insert(words)
# Test
print(c.getCorrection([' Zhuantang Street ',' Turn Tong Sister Road ',' Turn Tong Street to ']))
Results
The printed result is:
['Zhuantang Street', 'Zhuantang Street', 'Zhuantang Street']
It can be seen that all of them have been corrected successfully, with a certain effect, and then they will continue to be optimized.