Python implements a sorting method for Chinese

2020-06-01 10:10:31
OfStack

This article illustrates an example of how Python implements sorting for Chinese. I will share it with you for your reference as follows:

When Python compares the size of a string, it is based on the encoded value obtained by the ord function. Its sorting function, sort, makes it easy to sort Numbers and letters because they are in order in the code table.


>> print ','< '1'<'A'<'a'<' o '
True

But dealing with Chinese is not so easy. Chinese characters are usually sorted by pinyin and strokes. In GB2312, the most commonly used Chinese standard character set, 3,755 Chinese characters of grade 1 are coded in pinyin order, while 3,008 Chinese characters of grade 2 are arranged in alphabetical order.


>> print ' start '< ' salmon ',' once '<' grace '
True True

The result is that 'shu' and 'zeng' are both common words, while 'chub' and 'yi' are both secondary words, but no matter from the point of view of stroke or pinyin, the order of the two pairs should be reversed. For the sake of backward compatibility, both the extended GBK and GB18030 codes did not change the order of Chinese characters before, so the order after sort was out of order.

On the other hand, the Chinese unicode code is arranged according to the number of radicals and strokes in kangxi dictionary, so the ordering result is not the same as GB code.


# encoding=utf8
char=[' zhao ',' money ',' Sun. ',' li ',' measure ']
char.sort()
for item in char:
  print item.decode('utf-8').encode('gb2312')

The output is: "she sun li zhao qian"; And save it as gb2312


# encoding=gb2312
char=[' zhao ',' money ',' Sun. ',' li ',' measure ']
char.sort()
for item in char:
  print item

The output is: "li qian sun zhao she". Obviously, neither of these outcomes is what we want. So how do we order Chinese correctly?

To figure out the collation of Chinese dictionary: sorted by pinyin first, distinguish between 4, pinyin is the same as the stroke number, number of strokes and the same again according to the stroke order to distinguish the specific stroke type, xinhua dictionary using the order of 1 丨丿, b, also known as "the earth", there should be no stroke type also completely 1 sample. So Chinese ordering requires not only the phonetic Chinese pinyin comparison table, but also the specific stroke order data.

I thought there was a ready-made module, tried a few are not ideal. The pyzh conversion code supports less than 7, 000 words and has no tone. Shuimu's roy code covers more than 20,000 characters, but requires pysqlite support... It's better to stand on your own two feet

The most complete data I found is the unicode Chinese character encoding table uploaded from slowwind9999 to csdn (click here to download). , including the complete spelling of all 20902 Chinese characters, 5 strokes, zheng code, UNICODE, GBK, number of strokes, and the number of brush strokes (the pinyin part has no tone, and some phonetic symbols are wrong, such as hang, hang, hang, etc., please pay attention to the use). I extracted the data of the stroke sequence, and made the tone version of unicode Chinese characters with the program of "practical Chinese characters to turn pinyin" by kong zhijian, in which Chinese characters are marked with 4 sounds, and 319 Japanese and Korean Chinese characters have no tone to show the difference, and made some corrections according to the data of Chinese Canon (but there may still be errors). With these two tables, the work below is simple.


#  Establishment of pinyin dictionary 
dic_py = dict()
f_py = open('py.txt','r')
content_py = f_py.read()
lines_py = content_py.split('\n')
n=len(lines_py)
for i in range(0,n-1):
  word_py, mean_py = lines_py[i].split('\t', 1)
  dic_py[word_py]=mean_py
f_py.close()

The same is true of the brushstroke dictionary, which, even though the text has 20,000 lines, is fast to import, around 0.5 seconds. If you combine the two files and process them together, it should be faster.


#  Thesaurus lookup function 
def searchdict(dic,uchar):
  if isinstance(uchar, str):
    uchar = unicode(uchar,'utf-8')
  if uchar >= u'\u4e00' and uchar < = u'\u9fa5':
    value=dic.get(uchar.encode('utf-8'))
    if value == None:
      value = '*'
  else:
    value = uchar
  return value

Find Chinese, 1 law into UTF8 string, other characters outside the Chinese characters do not do processing, the original output. If you need an initial, just output the first character of the pinyin. As long as the data are accurate, the comparison is easy. Before the letter, the number (ai4) will be in front of the letter (ang2), while the number of digits in the stroke sequence value represents the number of strokes, and the value corresponds to the stroke weight. By directly comparing the number size, the correct order can be obtained. The code is as follows:


# Compare single character 
def comp_char_PY(A,B):
  if A==B:
    return -1
  pyA=searchdict(dic_py,A)
  pyB=searchdict(dic_py,B)
  if pyA > pyB:
    return 1
  elif pyA < pyB:
    return 0
  else:
    bhA=eval(searchdict(dic_bh,A))
    bhB=eval(searchdict(dic_bh,B))
    if bhA > bhB:
      return 1
    elif bhA < bhB:
      return 0
    else:
      return 'Are you kidding?'
# Comparison string 
def comp_char(A,B):
  charA = A.decode('utf-8')
  charB = B.decode('utf-8')
  n=min(len(charA),len(charB))
  i=0
  while i < n:
    dd=comp_char_PY(charA[i],charB[i])
    if dd == -1:
      i=i+1
      if i==n:
        dd=len(charA)>len(charB)
    else:
      break
  return dd
#  Sorting function 
def cnsort(nline):
  n = len(nline)
  lines='\n'.join(nline)
  for i in range(1, n): # Insertion method 
    tmp = nline[i]
    j = i
    while j > 0 and comp_char(nline[j-1],tmp):
      nline[j] = nline[j-1]
      j -= 1
    nline[j] = tmp
  return nline

Now we can rank Chinese according to the dictionary specification.


char=[' zhao ',' money ',' Sun. ',' li ',' measure ']
char=cnsort(char)
for item in char:
  print item.decode('utf-8').encode('gb2312')

Finally got "li qian she sun zhao", sample file click here to download.

I'm not taking the polyphonic case into account here. If you want to make the program recognize automatically, you can add a polyphonic phrase comparison table to judge by the context. I don't know where there is such data, but for the case of not too many polyphonic words, manual adjustment is enough.

PS: here are two more practical online sorting tools for your reference:

Online alphabetical sorting tool:
http://tools.ofstack.com/aideddesign/zh_paixu

Online reverse text sorting tool:
http://tools.ofstack.com/aideddesign/flipped_txt

More about Python related content to view this site project: the Python regular expression usage summary ", "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article is helpful to you Python programming.