Differences between Effective Python bytes and str

2021-12-12 09:19:45
OfStack

Directory 1, Python has two types to represent character sequences2, Unicode data and binary data transformation3, using the original 8-bit value and Unicode string 3.1 Problem 1: bytes and str instances are incompatible with each other 3.2 Problem 2: Unicode string manipulation is required when manipulating file handles

1. There are two types of Python that can represent character sequences

bytes Instance contains raw data, that is, 8-bit unsigned values (usually displayed according to ASCII encoding standard) str The instance contains the Unicode Code dots (code point, also called code dots), which correspond to text characters in human language


a = b'h\x6511o'
print(list(a))
print(a)

a = 'a\\u300 propos'
print(list(a))
print(a)


#  Output result 
[104, 101, 49, 49, 111]
b'he11o'

['a', '\\', 'u', '3', '0', '0', ' ', 'p', 'r', 'o', 'p', 'o', 's']
a\u300 propos

2. Unicode data and binary data conversion

Put Unicode To convert data to binary data, you must call str's encode Method (encoding) Convert binary data into Unicode Data, you must call bytes's decode Method (decoding) When you call these methods, you can specify the character set encoding, or you can use the system default scheme, usually UTF-8

3. Use the original 8-bit value with the Unicode string

Two issues to note when using the original 8-bit value and Unicode string:

This problem is equivalent to: using bytes And str Two problems that should be paid attention to when

3.1 Question 1: Incompatible instances of bytes and str

Using the + operator:


# bytes+bytes
print(b'a' + b'1')

# str+str
print('b' + '2')


#  Output result 
b'a1'
b2

bytes + bytes , str0 They are all allowed But bytes + str Will report an error


 bytes+str
print('c' + b'2')


#  Output result 
print('c' + b'2')
TypeError: can only concatenate str (not "bytes") to str

You can also use the 2-yuan operator to compare sizes between the same types


assert b'c' > b'a'

assert 'c' > 'a'

But bytes And str Errors will also be reported when using 2 yuan operator between


assert b'c' > 'a'


#  Output result 
    assert b'c' > 'a'
TypeError: '>' not supported between instances of 'bytes' and 'str'

Judge bytes And str Instances are equal
Instances of two types are always compared to False Even if the characters are identical


#  Judge  str , bytes
print('a' == b'a')


#  Output result 
False

Formatting% s in String

Both types of instances can appear in the % Operator to replace the format string on the left ( format string ) Inside %s

But! If the format string is bytes Type, you cannot use the str Instance to replace the %s , because Python Don't know this str What character set should be encoded according to


# %
print(b'red %s' % 'blue')


#  Output result 
    print(b'red %s' % 'blue')
TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

But! Conversely, if the format string is of type str, you can use the bytes Instance to replace the %s But the result may not be the expected result


# % 
print('red %s' % b'blue')


#  Output result 
red b'blue'

This will make the system bytes Instance is called above __repr__ Method
The result of the call replaces% s in the format string, so the program will output it directly b'blue' Instead of outputting Unicode 0

3.2 Problem 2: Unicode string manipulation is required when manipulating file handles

You cannot use the original bytes

Writing binary data to a file will report an error:


#  Write 2 Binary data 
with open('test.txt', "w+") as f:
    f.write(b"\xf1\xf2")


#  Output result 
    f.write(b"\xf1\xf2")
TypeError: write() argument must be str, not bytes

The error is reported because Unicode 2 Mode must be written in text mode Change the mode to wb Binary data can be written normally


with open('test.txt', "wb") as f:
    f.write(b"\xf1\xf2")
 Read from a file 2 Binary data 
with open('test.txt', "r+") as f:
    f.read()


#  Output result 
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte

Error reported because r mode must be read in text mode When a file handle is manipulated in text mode, the system uses the default text encoding scheme to process binary data Therefore, the above writing will let the system pass through bytes.decode Decode this data into an str string, and then use it str.encode Encode a string into a binary value However, for most systems, the default text encoding scheme is UTF-8 , so the system will probably put b'\xf1\xf2\xf3\xf4\xf5' Decoded as a string in UTF-8 format, and the above error will occur

Change the mode to rb to read binary data normally:


# bytes+bytes
print(b'a' + b'1')

# str+str
print('b' + '2')


#  Output result 
b'a1'
b2

Another modification is to set the encoding parameter to specify the string encoding:


# bytes+bytes
print(b'a' + b'1')

# str+str
print('b' + '2')


#  Output result 
b'a1'
b2

So there will be no abnormality

Note: The current operating system default character set encoding, Python 1 line code view the current operating system default encoding standard

In cmd:
# bytes+bytes
print(b'a' + b'1')

# str+str
print('b' + '2')


#  Output result 
b'a1'
b2
2