Differences between Effective Python bytes and str
- 2021-12-12 09:19:45
- OfStack
1. There are two types of Python that can represent character sequences
bytes
Instance contains raw data, that is, 8-bit unsigned values (usually displayed according to ASCII encoding standard)
str
The instance contains the
Unicode
Code dots (code point, also called code dots), which correspond to text characters in human language
a = b'h\x6511o'
print(list(a))
print(a)
a = 'a\\u300 propos'
print(list(a))
print(a)
# Output result
[104, 101, 49, 49, 111]
b'he11o'
['a', '\\', 'u', '3', '0', '0', ' ', 'p', 'r', 'o', 'p', 'o', 's']
a\u300 propos
2. Unicode data and binary data conversion
PutUnicode
To convert data to binary data, you must call str's
encode
Method (encoding)
Convert binary data into
Unicode
Data, you must call bytes's
decode
Method (decoding)
When you call these methods, you can specify the character set encoding, or you can use the system default scheme, usually UTF-8
3. Use the original 8-bit value with the Unicode string
Two issues to note when using the original 8-bit value and Unicode string:
This problem is equivalent to: usingbytes
And
str
Two problems that should be paid attention to when
3.1 Question 1: Incompatible instances of bytes and str
Using the + operator:
# bytes+bytes
print(b'a' + b'1')
# str+str
print('b' + '2')
# Output result
b'a1'
b2
bytes + bytes
,
str
0
They are all allowed
But
bytes + str
Will report an error
#
bytes+str
print('c' + b'2')
# Output result
print('c' + b'2')
TypeError: can only concatenate str (not "bytes") to str
You can also use the 2-yuan operator to compare sizes between the same types
assert b'c' > b'a'
assert 'c' > 'a'
But
bytes
And
str
Errors will also be reported when using 2 yuan operator between
assert b'c' > 'a'
# Output result
assert b'c' > 'a'
TypeError: '>' not supported between instances of 'bytes' and 'str'
Judge
bytes
And
str
Instances are equal
Instances of two types are always compared to
False
Even if the characters are identical
# Judge str , bytes
print('a' == b'a')
# Output result
False
Formatting% s in String
Both types of instances can appear in the
%
Operator to replace the format string on the left (
format string
) Inside
%s
But! If the format string is
bytes
Type, you cannot use the
str
Instance to replace the
%s
, because
Python
Don't know this
str
What character set should be encoded according to
# %
print(b'red %s' % 'blue')
# Output result
print(b'red %s' % 'blue')
TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'
But! Conversely, if the format string is of type str, you can use the
bytes
Instance to replace the
%s
But the result may not be the expected result
# %
print('red %s' % b'blue')
# Output result
red b'blue'
This will make the system
bytes
Instance is called above
__repr__
Method
The result of the call replaces% s in the format string, so the program will output it directly
b'blue'
Instead of outputting
Unicode
0
3.2 Problem 2: Unicode string manipulation is required when manipulating file handles
You cannot use the original
bytes
Writing binary data to a file will report an error:
# Write 2 Binary data
with open('test.txt', "w+") as f:
f.write(b"\xf1\xf2")
# Output result
f.write(b"\xf1\xf2")
TypeError: write() argument must be str, not bytes
The error is reported because
Unicode
2
Mode must be written in text mode
Change the mode to
wb
Binary data can be written normally
with open('test.txt', "wb") as f:
f.write(b"\xf1\xf2")
Read from a file 2 Binary data
with open('test.txt', "r+") as f:
f.read()
# Output result
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte
Error reported because r mode must be read in text mode
When a file handle is manipulated in text mode, the system uses the default text encoding scheme to process binary data
Therefore, the above writing will let the system pass through
bytes.decode
Decode this data into an str string, and then use it
str.encode
Encode a string into a binary value
However, for most systems, the default text encoding scheme is
UTF-8
, so the system will probably put
b'\xf1\xf2\xf3\xf4\xf5'
Decoded as a string in UTF-8 format, and the above error will occur
Change the mode to rb to read binary data normally:
# bytes+bytes
print(b'a' + b'1')
# str+str
print('b' + '2')
# Output result
b'a1'
b2
0
Another modification is to set the encoding parameter to specify the string encoding:
# bytes+bytes
print(b'a' + b'1')
# str+str
print('b' + '2')
# Output result
b'a1'
b2
1
So there will be no abnormality
Note: The current operating system default character set encoding, Python 1 line code view the current operating system default encoding standard
In cmd:
2# bytes+bytes print(b'a' + b'1') # str+str print('b' + '2') # Output result b'a1' b2