A thorough understanding of the coding issues in Python

  • 2020-12-19 21:07:15
  • OfStack

Python is very powerful for processing text, but if you are a beginner and don't understand the coding mechanism in python, you will often encounter scrambled or decode error. The purpose of this article is to briefly explain python's coding mechanism and to give some suggestions.

Question1: What's the problem?

Questions are our targets. If we don't have questions in mind to study, we will miss the point.

The programming environment used in this article is centos6.7, python2.7. We type python in shell to open the python command line and type the following two sentences:


 s = " China zg"
 e = s.encode("utf-8")

Now the question is: does this code work?

The answer is no. The following errors will be reported:

[

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

]

Please pay attention to 0xe4 in error 1, which is the breakthrough point for us to analyze the error.

I believe many people have encountered this mistake. So here's a new problem.

Question 2: Why?

To understand why, let's take a closer look at the execution process of these two sentences:

First, we typed Chinese zg into the python command line interpreter using the keyboard and put English double quotation marks on it, and then assigned a value to the variable s. Doesn't that seem very common? In fact, there is a lot of mystery inside.

When we type characters into a program through the keyboard, we do so through the operating system. The Chinese zg we see on the screen is actually a feedback from the operating system to us humans, saying, "Hey, buddy, you typed characters into the program. Chinese zg"

What is the operating system's feedback to the program? The answer is the 01 string, what does this 01 string look like, and how does it come into being?

The answer is that the operating system uses its own default encoding to encode China zg and gives the 01 string to the program.

The default encoding of centos system we use is ES56en-8, so we can know what string 01 is as long as we know the encoding of utf-8 for each character of Chinese zg.

After the query, you can obtain their encodings (expressed in hexadecimal and base 2) :

z g
E4B8AD E59BBD 7A 67
11100101 10011011 10111101 11100101 10011011 10111101 01111010 01100111

Now we know what the 01 string that the operating system passes to the program looks like. And then, what does the program do with it?

When the program sees the 01 string enclosed in double quotes, it knows that the 01 string is a string. The string is then assigned to s.

So, that's the logic of the first sentence.

Now proceed to the second sentence.

e = s.encode("utf-8") Means to encode the string s with ES77en-8 and assign the encoded string to e. The problem is, the program now knows that the 01 string in s is a string, and it also knows that the 01 string represents a string, but what is the encoding of the string? We must know the existing encoding of string 01 in order to parse the characters inside, and we must also recode it with a new encoding such as utf-8. The operating system passed the 01 string to the program without telling the program what the character encoding was for the 01 string.

At this point, the python program recognizes the contents of s using its default code as the code for s. The default encoding is ASCII, so it interprets the 01 string using ASCII, recognizes the contents of the string, and converts the string to the ES88en-8 encoding.

Ok, the first byte that the program hits is E4 (11100101), silly! There is no such thing in the ASCII code, because the ASCII code has zero byte bits.

How to do?

Error reporting, so we see the error on top.

The 0xe4 in the error is the first byte of the utf8 encoding of the character "in".

Question 3: How?

You know what the problem is, and how do you solve it?

Obviously, we just have to tell the program that the 01 string in s is encoded as utf-8, and the program should work correctly.

But there's one problem with this approach. It's just not universal enough.

Suppose I have a program that reads a lot of text files, and each text file has a different encoding. Wouldn't it maintain one encoding message for each file it reads? Is very complicated.

Step 1: If the contents of these text files have to be compared to each other and connected, etc., the encoding is not equal to one, isn't it more troublesome?

How does python solve this problem intelligently?

Simple, decode!

decode means that you have a string and you know its encoding, so as long as you encode the string decode, THEN python will recognize the contents of the character and create an int array to store the unicode sequence number for each character.

By doing this for all strings, you ensure that all strings obtained from various sources have a one-like representation during the course of the program. So they can do all sorts of things very easily.

The int array described above will be encapsulated by python into one object, the unicode object.

Question 4: How to solve it?

Next, enter the following two lines of code on the python command line:


e = s.decode("utf-8")
isinstance(e,unicode)

The output of the program is True, which indicates that the e returned after decode is indeed an unicode object.

unicode in this case is a class, it's the class in python.

e is called the unicode string, meaning that it holds the unicode sequence number of the character and does not use any encoding.

We can then encode e in any one of the following ways, for example


e.encode("utf-8")
e.encode("gbk")

As long as you choose a code that encodes the characters in e, if not, an error will be reported.

For example, if you try this:

e.encode("ascii")

Since ASCII cannot encode the two characters of China, encode error will be exploded.

So far, we've seen two types of errors, decode error and encode error, and solved them.

Question 5: How do you evaluate the character encoding process of python?

First of all, this approach is very simple. Any text that enters the program decode once becomes an unicode object, in which int stores the unicode serial number for each character. Just do encode once more when the text is ready to be output, and encode it into the code we need.

The question is, is it too much space to use 1 int for all the characters? After all, with ASCII encoding, English characters are only one byte long.

It does cost a little bit of space, but right now the memory is big enough, and we only use it internally, when strings are written to a file or sent over the network, we encode them accordingly.

One more question. What about the strings that are written in the program? Do you have to do it once per use? Different operating systems use different codes by default. When we use linux, we usually need to use utf8 to do decode, and gbk to do decode. This way, our code will only run on a specific platform.

python gives us a very simple way to add u to a string and it will detect the encoding of the system and automatically complete decode.

Question 6: To sum up, what have we learned?

This article examines the coding problems in python in detail, starting with a very common error. We've seen the simplicity of python's handling of character problems, and we can understand why python has such powerful text processing capabilities.

Quiz question: See if you really understand.

Suppose there is a file a.txt on a linux. The contents of the file are two characters "Chinese" and the encoding is ES226en-8.

Now, in the python program write the following statement:


import codec
s=""
with codec.open("a.txt",encoding="utf-8") as f:
s=f.readline().strip()
with open("b.txt","w") as f:
f.write(s)

Can this code be executed? Why is that?

Answer: No!

s means unicode, and python will encode it when written out. The default ascii encoding cannot encode two characters of "Chinese", so it will report an error!

conclusion


Related articles: