Introduction to python natural language coding conversion module codecs

  • 2020-05-07 19:56:08
  • OfStack

python supports multi-language processing very well, it can handle the arbitrary encoding of characters, here is an in-depth study of python processing for many different languages.

One thing to be clear is that when python is going to do the encoding conversion, it will resort to the internal encoding. The conversion process is as follows:


The original code -> The internal encoding -> Objective coding

The interior of python is handled by unicode, but the use of unicode needs to be taken into consideration that there are two encoding formats: 1 is UCS-2, which has 65,536 code points, and 1 is UCS-4, which has 214,7483648g code points. For both formats, python is supported, which is specified at compile time by enable-unicode =ucs2 or enable-unicode =ucs4. So what code do we have for python by default? One way is to judge by the value of sys.maxunicode:

import sys
print sys.maxunicode

If the output value is 65535, it is UCS-2, and if the output value is 1114111, it is UCS-4.
We need to realize one thing: when a string is converted to internal encoding, it is not of str type! It is of type unicode:


a = " aside "
print type(a)
b = a.unicode(a, "gb2312")
print type(b)

Output:

<type 'str'>
<type 'unicode'>

At this time, b can be easily converted to any other encoding, such as utf-8:

c = b.encode("utf-8")
print c

The output from c looks garbled, that's right, because it's a string of utf-8.

Well, it's time to talk about the codecs module, which is closely related to the concept I mentioned above. codecs is specifically used for coding transformations, but of course, it can be extended to other code related transformations through its interface, which is not covered here.


#-*- encoding: gb2312 -*-
import codecs, sys print '-'*60
# create gb2312 The encoder
look  = codecs.lookup("gb2312")
# create utf-8 The encoder
look2 = codecs.lookup("utf-8") a = " I love tian 'anmen square in Beijing " print len(a), a
# the a Code as internal unicode, But why are methods called decode What I understand is that gb2312 Is decoded as unicode
b = look.decode(a)
# The returned b[0] Is the data, b[1] This is the length, and this is the type unicode the
print b[1], b[0], type(b[0])
# I'm going to code it internally unicode convert gb2312 The encoded string, encode Method will return 1 Individual string types
b2 = look.encode(b[0])
# Found no 1 What kind of place? After converting back, the string length is determined by 14 Into the 7! Now the return length is the actual word count, instead of the number of bytes
print b2[1], b2[0], type(b2[0])
# Although the word count is returned above, it does not mean to use len o b2[0] The length of 7 Yes, still 14 , just codecs.encode Count words
print len(b2[0])

The code above is the use of codecs, which is the most common use. Another question is, what if we're dealing with a file with a different type of character encoding? This read to do processing also requires special processing. codecs also provides methods.


#-*- encoding: gb2312 -*-
import codecs, sys # with codecs To provide the open Method to specify the language encoding of the open file, which is automatically converted to internal when read unicode
bfile = codecs.open("dddd.txt", 'r', "big5")
#bfile = open("dddd.txt", 'r') ss = bfile.read()
bfile.close()
# Output, what you see is the result of the transformation. If you use the language built in open Function to open the file. This must be garbled
print ss, type(ss)

For the above big5 file, you can try to find the big5 file.


Related articles: