Python based Chinese characters to GBK code implementation code

  • 2020-04-02 09:42:27
  • OfStack

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439166.png ">
As shown in the figure, the encoding of "wide" is %B9%E3, let's call %B9 as the section encoding and %E3 as the character encoding (the second encoding).

Ideas:
Collecting Chinese characters from GBK code page http://ff.163.com/newflyff/gbk-list/
From the practical point of view, only selected " 鈼 � GBK/2: GB2312 Chinese characters" this section, a total of 3755 Chinese characters.
Look at the law: section code from b0-d7, and for the Chinese code from a1-fe, that is, 16*6-2=94, very regular.
Step 1: extract the commonly used Chinese characters in python and store them in a dictionary file in sequence. The characters are separated by Spaces.
Step 2: according to the rule of encoding from a1-fe, 94 Chinese characters in each section, locate the section encoding first, and use the position of Chinese characters in a section to locate the character encoding

Implementation:
Step 1: extract Chinese characters
 
with open('E:/GBK.txt') as f: 
s=f.read().splitlines().split() 

There are repeated section codes in the list obtained by segmentation, and B0/B1 should be removed... Similar symbols and the Chinese 0-9/ a-f character
To decode the obtained characters:

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439682.png ">

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439398.png ">
Delete these characters:
Firstly, the list obtained from the segmentation is all decoded, and then

 
gbk.remove(u'uff10') 

When I delete the characters here, I generate a series of strings with range, and then I handle them with notepad++, but I can't find an easy way
 
for t in [u'uff10',u'uff11',u'uff12',u'uff13',u'uff14',u'uff15',u'uff16',u'uff17',u'uff18',u'uff19',u'uff21',u'uff22',u'uff23',u'uff24',u'uff25',u'uff26']: 
gbk.remove(t) 

Then, the section encoding like b0-d7 is removed, and the encoding like a1-fe is also used when extracting the character encoding. Therefore, I want to generate such a list for the convenience of deletion and index operation.

Generated code series:
The rows are coded 0-9 a-f, and the columns are coded a-f
Incrementing from A1, handling the boundary (a9-aa) manually, using the ord() and CHR () functions, converting between ASCII encoding and Numbers.
 
t=['A1'] 
while True: 
if t[-1]=='FE': 
break 
if (ord(t[-1][1])>=48 and ord(t[-1][1])<57) or (ord(t[-1][1])>=65 and ord(t[-1][1])<70): 
t.append(t[-1][0]+chr(ord(t[-1][1])+1)) 
continue 
if ord(t[-1][1])>=57 and ord(t[-1][1])<65: 
t.append(t[-1][0]+chr(65)) 
continue 
if ord(t[-1][1])>=70: 
t.append(chr(ord(t[-1][0])+1)+chr(48)) 
continue 

List of results:

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439698.png ">

With this encoding sequence in place, the b0-d7 characters can be removed from the GBK library.
Finally, we found that there are still Spaces not deleted, the unicode code of the space is \u3000
GBK. Remove (u '\ u3000)
Finally encode as utf-8 and save it to a dictionary file.

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439897.png ">
I put the dictionary file on a network backup, outside the chain: (link: http://dl.dbank.com/c0m9selr6h)

Step 2: index Chinese characters

Index is a simple algorithm, because the dictionary man is stored in order according to the original, and GBK code table 2, 3755 characters strictly abide by the law of 94 Chinese characters per section and then to a simple divisor integer + 1 to locate the bar code, then use Chinese character index - section index * 94 to get the index Chinese characters in this section, and then use the above generated A1 - FE list and index to locate the second encoding.
So the idea is to code, and then debug
Attached are python code and comments:

 
def getGBKCode(gbkFile='E:/GBK1.1.txt',s=''): 
#gbkFile The dictionary file   A total of 3755 The Chinese characters  
#s Is the Chinese character to be converted gb2312 Code from IDLE Input Chinese character code  

# Read in the dictionary  
with open(gbkFile) as f: 
gbk=f.read().split() 

# generate A1-FE Index encoding of  
t=['A1'] 
while True: 
if t[-1]=='FE': 
break 
if (ord(t[-1][1])>=48 and ord(t[-1][1])<57) or (ord(t[-1][1])>=65 and ord(t[-1][1])<70): 
t.append(t[-1][0]+chr(ord(t[-1][1])+1)) 
continue 
if ord(t[-1][1])>=57 and ord(t[-1][1])<65: 
t.append(t[-1][0]+chr(65)) 
continue 
if ord(t[-1][1])>=70: 
t.append(chr(ord(t[-1][0])+1)+chr(48)) 
continue 
# Index each character in turn  
l=list() 
for st in s.decode('gb2312'): 
st=st.encode('utf-8') 
i=gbk.index(st)+1 
# Section code from B0 To begin, get the section code of the Chinese character  
t1='%'+t[t.index('B0'):][i/94] 
# The index number of a Chinese character in a node  
i=i-(i/94)*94 
t2='%'+t[i-1] 
l.append(t1+t2) 
# Finally, the output is separated by a space  
return ' '.join(l) 

< img Alt = "" border = 0 SRC =" http://files.jb51.net/upload/201202/20120219202439173.png ">


Admittedly, my python code is not that neat
Attached is my twitter ID: olam Cooper


Related articles: