The Python implementation converts files in utf 8 format to files in GBK format

  • 2020-04-02 14:31:54
  • OfStack

Requirement: convert files in utf-8 format to files in GBK format

The implementation code is as follows:


def ReadFile(filePath,encoding="utf-8"):
    with codecs.open(filePath,"r",encoding) as f:
        return f.read()
 
def WriteFile(filePath,u,encoding="gbk"):
    with codecs.open(filePath,"w",encoding) as f:
        f.write(u)
 
def UTF8_2_GBK(src,dst):
    content = ReadFile(src,encoding="utf-8")
    WriteFile(dst,content,encoding="gbk")

Code description:

The second argument to the function ReadFile specifies that the file is read encoded in utf-8, the resulting content is Unicode, and then Unicode is written to the file in GBK format.

So you can implement the requirements.
However, if the file to be converted contains some characters that are not included in the GBK character set, an error will be reported, as follows:


UnicodeEncodeError: 'gbk' codec can't encode character u'xa0' in position 4813: illegal multibyte sequence

The above error message means that when encoding Unicode to GBK, Unicode u'\xa0' cannot be encoded to GBK.

Here, we need to figure out the relationship between gb2312, GBK and gb18030


GB2312 : 6763 The Chinese characters
GBK : 21003 The Chinese characters
GB18030-2000 : 27533 The Chinese characters
GB18030-2005 : 70244 The Chinese characters

So GBK is a superset of GB2312, and GB18030 is a superset of GBK.
After clarifying the relationship, we further improved the code:

def UTF8_2_GBK(src,dst):
    content = ReadFile(src,encoding="utf-8")
    WriteFile(dst,content,encoding="gb18030")

After running, found no error, can run normally.

Because u'\xa0' can be found in the GB18030 character set.
  In addition, there is another implementation scheme:
I need to modify the WriteFile method


def WriteFile(filePath,u,encoding="gbk"):
    with codecs.open(filePath,"w") as f:
        f.write(u.encode(encoding,errors="ignore"))

Here, we encode Unicode in GBK format, but notice the second parameter of the encode function, we assign the value "ignore", which means that when encoding, ignore those characters that cannot be encoded, and decode the same.

However, when we executed, we found that we could successfully change the utf-8 file to ANSI format. However, it is also found that each line in the generated file has a blank line.

Here, you can specify that the file is written as a binary stream. The modified code is as follows:


def WriteFile(filePath,u,encoding="gbk"):
    with codecs.open(filePath,"wb") as f:
        f.write(u.encode(encoding,errors="ignore"))


Related articles: