Details of string manipulation and encoding Unicode in Python
- 2020-05-24 05:47:43
- OfStack
This article mainly introduces you about Python string operation and encoding Unicode 1 some knowledge, the following words do not say, need friends to learn from the following 1.
String type
str
: Unicode string. Strings constructed using either '' or r' are str, and single quotes can be replaced by double or 3 quotes. No matter which way you specify it, it makes no difference when stored inside Python.
bytes
: 2 base string. Since files in other formats such as jpg cannot be displayed in str, bytes is used. Each byte of bytes is a number between 0 and 255. If you print, Python will display ASCII as the part that can be represented by ASCII for easy reading. bytes supports almost all str methods except formatting, even including the re module
bytearray()
: a string in base 2 that can be changed in place.
utf-8 coding range
范围 | 字节数 | 存储格式 |
0x0000~0x007F (0 ~ 127) | 1字节 | 0xxxxxxx |
0x0080~0x07FF(128 ~ 2047) | 2字节 | 110xxxxx 10xxxxxx |
0x0800~FFFF(2048 ~ 65535) | 3字节 | 1110xxxx 10xxxxxx 10xxxxxx |
0x10000~1FFFFFF(65536 ~ 2097152) | 4字节 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
0x2000000~0x3FFFFFF | 5字节 | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
0x4000000~0x7FFFFFFF) | 6字节 | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Byte order marks BOM
BOM is short for byte order marker,
Specifies the rules for coding writes
Python will not write the BOM header when writing to a file using the 'utf-8' encoding, but if you specify the 'utf-8-sig' encoding, Python will be forced to write 1 BOM header.
Using 'utf-16-be' will not write an BOM header, but using 'utf-16' will write an BOM header.
>>> open('h.txt','w',encoding='utf-8-sig').write('aaa')
3
>>> open('h.txt','rb').read()
b'\xef\xbb\xbfaaa'
>>> open('h.txt','w',encoding='utf-16').write('bbb')
3
>>> open('h.txt','rb').read()
b'\xff\xfeb\x00b\x00b\x00'
>>> open('hh.txt','w',encoding='utf-16-be').write('ccc')
3
>>> open('hh.txt','rb').read()
b'\x00c\x00c\x00c'
>>> open('h.txt','w',encoding='utf-8').write('ddd')
3
>>> open('h.txt','rb').read()
b'ddd'
The rule at read time
If the correct encoding is specified, BOM will ignore it, otherwise BOM will display as garbled or return an exception.
>>> open('h.txt','r').read()
' Nobelium � dd'
>>> open('h.txt','r',encoding='utf-8-sig').read()
'ddd'
Encoding and decoding
>>> ord(' In the ') #20013
>>> chr(20013) #' In the '
Hard-code Unicode into the string.
'xhh' : 1 character in 2-bit base 106
'\uhhhh' : 1 character in 4-bit base 106:
'Uhhhhhhhh' : 1 character in 8-bit base 106
>>> s = 'py\x74h\u4e2don' #'pyth中on'
str, bytes, bytearray
str.encode(encoding='utf-8')
bytes(s,encoding='utf-8')
bytes.decode(encoding='utf-8')
str(B, encoding='utf-8')
bytearray(string, encoding='utf-8')
bytearray(bytes)
Document encoding declaration
Python USES utf-8 encoding by default.
bytes
0
: indicates that the declaration document is encoded latin-1.
Help function
sys.platform #'win32'
sys.getdefaultencoding() # 'utf-8'
sys.byteorder #'little'
s.isalnum() #s Represents a string
s.isalpha()
s.isdecimal
s.isdigit()
s.isnumeric()
s.isprintable()
s.isspace()
s.isidentifier() # Returns if the string can be used as a variable name True
s.islower()
s.isupper()
s.istitle()
conclusion
The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.