Details of string manipulation and encoding Unicode in Python

2020-05-24 05:47:43
OfStack

This article mainly introduces you about Python string operation and encoding Unicode 1 some knowledge, the following words do not say, need friends to learn from the following 1.

String type

str : Unicode string. Strings constructed using either '' or r' are str, and single quotes can be replaced by double or 3 quotes. No matter which way you specify it, it makes no difference when stored inside Python.

bytes : 2 base string. Since files in other formats such as jpg cannot be displayed in str, bytes is used. Each byte of bytes is a number between 0 and 255. If you print, Python will display ASCII as the part that can be represented by ASCII for easy reading. bytes supports almost all str methods except formatting, even including the re module

bytearray() : a string in base 2 that can be changed in place.

utf-8 coding range

范围	字节数	存储格式
0x0000~0x007F (0 ~ 127)	1字节	0xxxxxxx
0x0080~0x07FF(128 ~ 2047)	2字节	110xxxxx 10xxxxxx
0x0800~FFFF(2048 ~ 65535)	3字节	1110xxxx 10xxxxxx 10xxxxxx
0x10000~1FFFFFF(65536 ~ 2097152)	4字节	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x2000000~0x3FFFFFF	5字节	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x4000000~0x7FFFFFFF)	6字节	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Byte order marks BOM

BOM is short for byte order marker,

Specifies the rules for coding writes

Python will not write the BOM header when writing to a file using the 'utf-8' encoding, but if you specify the 'utf-8-sig' encoding, Python will be forced to write 1 BOM header.

Using 'utf-16-be' will not write an BOM header, but using 'utf-16' will write an BOM header.


>>> open('h.txt','w',encoding='utf-8-sig').write('aaa')
3
>>> open('h.txt','rb').read()
b'\xef\xbb\xbfaaa'
>>> open('h.txt','w',encoding='utf-16').write('bbb')
3
>>> open('h.txt','rb').read()
b'\xff\xfeb\x00b\x00b\x00'
>>> open('hh.txt','w',encoding='utf-16-be').write('ccc')
3
>>> open('hh.txt','rb').read()
b'\x00c\x00c\x00c'
>>> open('h.txt','w',encoding='utf-8').write('ddd')
3
>>> open('h.txt','rb').read()
b'ddd'

The rule at read time

If the correct encoding is specified, BOM will ignore it, otherwise BOM will display as garbled or return an exception.


>>> open('h.txt','r').read()
' Nobelium � dd'
>>> open('h.txt','r',encoding='utf-8-sig').read()
'ddd'

Encoding and decoding

chr and ord


>>> ord(' In the ') #20013
>>> chr(20013) #' In the '

Hard-code Unicode into the string.

'xhh' : 1 character in 2-bit base 106

'\uhhhh' : 1 character in 4-bit base 106:

'Uhhhhhhhh' : 1 character in 8-bit base 106

>>> s = 'py\x74h\u4e2don' #'pyth中on'

str, bytes, bytearray

str.encode(encoding='utf-8')

bytes(s,encoding='utf-8')

bytes.decode(encoding='utf-8')

str(B, encoding='utf-8')

bytearray(string, encoding='utf-8')

bytearray(bytes)

Document encoding declaration

Python USES utf-8 encoding by default.

bytes0 : indicates that the declaration document is encoded latin-1.

Help function


sys.platform  #'win32'
sys.getdefaultencoding() # 'utf-8'
sys.byteorder  #'little'
s.isalnum()  #s Represents a string 
s.isalpha()
s.isdecimal
s.isdigit()
s.isnumeric()
s.isprintable()
s.isspace()
s.isidentifier() # Returns if the string can be used as a variable name True
s.islower()
s.isupper()
s.istitle()

conclusion

The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.