Solution to Python Chinese problem (a summary of the previous experience beginners must see)

  • 2020-04-02 09:28:18
  • OfStack

Because Python comes with documentation, the help function can be used to query the usage instructions for each system function. In general, the key usage and attention points are clearly stated in the documentation for this system. I tried to find the function explanation of the Chinese version of the system document on the Internet, but I couldn't find it, so I decided to learn using the function explanation of the English version of the system.

Tkinter and wxPython programming if you want to, want to know the general widget methods and properties, the use of English is not too good, I recommend you, you can go to see the Python and Tkinter programming book, inside page 392-538 appendix B and appendix C chose to commonly used functions and almost all of the attributes in introduction, not to be missed.

The tool I mentioned above was quickly made. You can query the function that has not been queried, and save the keyword key and the query result info, so that you can turn it out from the list next time. If you don't, add it to the list manually -- that's a simple gadget. Everything seems to be going well. However, the problem also comes: after opening the info in English, explain that some of the words do not know the meaning, after looking up the words want to write in the info, after saving can be directly opened from the hard disk next time. But in the English info input Chinese, there is a problem in the process of saving decoding, that is, decoding to the Chinese part will pop up the following error to:

UnicodeEncodeError: 'ASCII' codec can't encode character u'\u6211' in position 61: ordinal not in range(128)

The position of 61 is elastic, which is the position where Chinese is added in info. This error is almost always present when I want to write the modified info to a file:

 
fp = open('tt.txt','w')
fp.write(info.encode("UTF-8")) # The error
fp.close()

The three lines themselves look error-free. But there's an error in the middle line of code. Is it the wrong way to encode? I have tried many kinds of codes, such as ANSI, utf-8, SHIFT_JIS, GB2312, GBK, and so on. So I got confused.

Now I know why I was wrong. The problem is the modified string variable info. The data in info is a composite string of strings that I looked up from the system through the help function (i.e., the original, pure English info) plus the Chinese I entered manually. When I query the system document from the system, I save the original info as follows:

 
fp = open('tt.txt','w')
fp.write(info)
fp.close()

Note that the mistake is to write the original info directly to the file. You know what the code is after you write it like this? If you open up tt.txt and look at the encoding you will see that it is encoded in ANSI format. So errors are produced in this way: I query keyword key, ANSI format string info to read it to the controls, and then I have to manually add the utf-8 format of Chinese characters, so lead to string together to form the info, is a mess and have a variety of string encoding info, system how to write can only use a kind of coding way to mix the string info again wrote tt. TXT.

So, the bottom line is that when you're operating in memory, you're free to do whatever the encoding is, and the system automatically decides on a case-by-case basis. However, if you want to use Chinese characters and temporarily save data or strings in the form of a file, be sure to write them in utf-8 format the first time you write the file, which is the following:


fp = open('tt.txt','w') 
fp.write(info.encode("UTF-8")) 
fp.close() 

This will ensure that the next time you read it, you can print and display it without converting the encoding, even as control text. Be aware of this.

The problem has been found. Let's have some other discussion.

Some people say, just use # -* -coding: utf-8 -*-. It's not.

Through my test (I used the IDLE(Python2.5.4 GUI) compiler. [1] whether I started with # -* -coding: utf-8 -*- or whether the default utf-8 encoding is set in the software, the use of Chinese between controls and files is no problem. [2] info=' Chinese '; All of these operations are ok. Just use the usual way of reading. The reason, I think, is that the compiler has been upgraded to solve the problem of displaying and using Chinese. The situation that the early Chinese language could not be used now no longer exists.


#coding=utf-8 
try: 
JAP=open("jap.txt","r") 
CHN=open("chn.txt","r") 
UTF=open("utf.txt","w") 

jap_text=JAP.readline() 
chn_text=CHN.readline() 
# First, decode into UTF-16, again encode into UTF-8 
jap_text_utf8=jap_text.decode("SHIFT_JIS").encode("UTF-8") 
# Don't to utf-8 Can also be  
chn_text_utf8=chn_text.decode("GB2312").encode("UTF-8") 
# It doesn't matter what the case is utf-8 Is the same  
UTF.write(jap_text_utf8) 
UTF.write(chn_text_utf8) 
UTF.close() 
except IOError,e: 
print "open file error",e 

This is the code I extracted from the "learning python to handle python coding" article in (link: #). As an explanation, both jap_text_utf8 and chn_text_utf8 above are guaranteed to be the machine's default encoding, or utf-8 encoding, and the most important thing is to be consistent. With the unified encoding of utf-8, it can be written to a file and read out again without any problems. Use the following common methods when reading:

 
filen = open('tt.txt')
info = filen.read()
print info

In addition. Someone used the following method to encode and convert:


import sys 
reload(sys) 
sys.setdefaultencoding('utf8') 

def ConvertCN(s): 
return s.encode('gb18030') 

def PrintFile(filename): 
f = file(filename, 'r') 
for f_line in f.readlines(): 
print ConvertCN(f_line) 
f.close() 

if __name__ == "__main__": 
PrintFile('1.txt') 
print ConvertCN("n******  Press any key to exit ! ******") 
print sys.stdin.readline()

In my tests, this approach was not feasible. If the second line is removed, the setdefaultencoding function in the third line will be invalid. If the second line is left, the third and subsequent lines are not executed (though no errors are reported). Please give it a try.

  In addition, the article "in-depth analysis of python Chinese garbled code problem" talked a lot about how to encode text, which was an eye-opener for me. Text encoding principle: the original is in the text beginning to add the appropriate annotation symbols to represent the internal encoding, so the interpreter will in some corresponding rules according to the byte of a step or a flexible manner to translate bytes, get the original, translation of the step length and rules is the beginning of the corresponding instructions. So, if your text is encoded in a single byte format, you can add an appropriate rule at the beginning of your encoding that tells someone how to translate your encoded text. Among them, the knowledge at the end of BOM_UTF_8 and so on is also very interesting. Similarly, there is BOM_UTF_16 and so on. The symbols at the end of the text are different in different encoding ways.


Related articles: