An in depth analysis of unicode and bytes in python3

  • 2021-07-09 08:35:02
  • OfStack

Recently, I wrote a few python3 programs, and I can see bytes type in four places, but it does not exist in python2, which is also the remarkable difference between python3 and python2.

In the past, when writing python2 code, we often encountered many exceptions of coding error, because python2 did not support unicode very well. In python3, all the code written is unicode, and the python parser is internally converted (unless you display a defined bytes type) to unicode when running, reducing the possibility of errors.

In python3, there are two string types, and the default is str, or unicode, also known as the text type. However, a program always has I/O operation (disk, network), that is, I/O2 binary data, which is defined as bytes type in python3. The bytes type is a single byte string containing an integer between 0 and 256.

So how to define the bytes type, there are two ways to display it, such as:


# Only allow ASCII Value 
x=b'abc'
y=b'\xe6\x88\x91'
print (x,y)
# Right unicode Character set for specific encoding 
t=bytes(" We ","UTF-8")
# Output b'\xe6\x88\x91\xe4\xbb\xac'
#1 Chinese characters, UTF-8 Code occupancy 3 Bytes 
print (t)
# Return 6 For python For example, it is the length of the byte sequence 
print (len(t))
# Return 2 Represents two characters 
print (len(" We "))

Next, talk about the conversion between str type and bytes type. For example, after reading binary data from the network, python needs you to show that it is converted to str type, which means that python will not implicitly convert between str and bytes, which seems to be a lot of trouble, but it will reduce the probability of your mistakes and make sure what you want.

If you want to convert str to bytes, you must select 1 encoding to clarify how binary data is encoded, such as:


x=" I "
y=x.encode("UTF-8")
z=x.encode("GBK")
#b'\xe6\x88\x91' b'\xce\xd2'
print (y,z)

If you want to convert bytes into str, you also need a code. It must be noted that you must know what the code of binary data is. If you choose the wrong one, you will make an error when converting to unicode. In addition, inside python, it doesn't care what the code of binary data is. As long as it is bytes type, it is a string of bytes, such as:


x=b'\xe6\x88\x91'
print (x.decode("UTF-8"))
# Will report an error 
print (x.decode("GBK"))

In a word, "unicode is used internally and bytes type is used externally". In python built-in library, many functions will explain whether str type or bytes type is needed (strictly speaking, bytes-like objects, such as bytes and bytearray). When writing code, you must see it clearly. For example, new method of hamc library requires:


hmac.new(key, msg=None, digestmod=None) key is a bytes or bytearray object giving the secret key

Many libraries, especially the third-party library (such as requests), do a lot of internal conversion work in order to be compatible with python2 and python3, so that you don't realize the existence of bytes type. Although the productivity is improved, it is not very beneficial to understand python.

If you want to fully understand the application of bytes and str, you can refer to two built-in functions, open and write.

When a file is opened in text, python is automatically converted internally to str type, such as:


file ="t.txt"
t = open(file,mode="r").read()

If it is opened in binary mode, if it is displayed at the terminal, it needs to be converted to str type, such as:


file ="t.txt"
t = open(file,mode="rb").read()
print (t.decode())
print (t,type(t))

If it is written in binary mode, bytes type data is written directly, such as:


file="t.txt"
t=open(file,mode="wb")
t.write(b'\xe6\x88\x91')

In the above examples, there is no explanation of which encoding to use. If the specification is not displayed, 1 general encoding is equivalent to locale.getpreferedencoding() .

Summarize


Related articles: