Detailed explanation of re module case of Python

2021-11-29 07:59:01
OfStack

1. Regular expressions

re is a unique string matching module of python. Many functions provided in this module are based on regular expressions, and regular expressions are fuzzy matching strings and extracting their own required string parts, which are universal to all languages. Note:

The re module is unique to python Regular expressions are available in all programming languages re module, regular expression is to operate on string

Because most of the methods in the re module rely on regular expressions, you learn regular expressions first.

(1) Common regularity

1. Character group

Various characters that may appear in the same position form a character group, which is represented by [] in regular expressions

正则	待匹配字符	匹配结果	说明
[0123456789]	8	True	在1个字符组里枚举合法的所有字符，字符组里的任意1个字符和"待匹配字符"相同都视为可以匹配
[0123456789]	a	False	由于字符组中没有"a"字符，所以不能匹配
[0-9]	7	True	也可以用-表示范围,[0-9]就和[0123456789]是1个意思
[a-z]	s	True	同样的如果要匹配所有的小写字母，直接用[a-z]就可以表示
[A-Z]	B	True	[A-Z]就表示所有的大写字母
[0-9a-fA-F]	e	True	可以匹配数字，大小写形式的a～f，用来验证106进制字符

2. Characters

元字符	匹配内容
.	匹配除换行符以外的任意字符
\w	匹配字母或数字或下划线
\s	匹配任意的空白符
\d	匹配数字
\n	匹配1个换行符
\t	匹配1个制表符
\b	匹配1个单词的结尾
^	匹配字符串的开始
$	匹配字符串的结尾
\W	匹配非字母或数字或下划线
\D	匹配非数字
\S	匹配非空白符
a\|b	匹配字符a或字符b
()	匹配括号内的表达式，也表示1个组
[...]	匹配字符组中的字符
[^...]	匹配除了字符组中字符的所有字符

3. Quantifiers

量词	用法说明
*	重复零次或更多次
+	重复1次或更多次
?	重复零次或1次
{n}	重复n次
{n,}	重复n次或更多次
{n,m}	重复n到m次

(2) Use of regular expressions

1,. ^ $

正则	待匹配字符	匹配结果	说明
a.	abacad	abacad	匹配所有"a."的字符
^a.	abacad	ab	只从开头匹配"a."
a.$	abacad	ad	只匹配结尾的"a.$"

2, * +? {}

正则	待匹配字符	匹配结果	说明
a.?	abefacgad	ab ac ad	?表示重复零次或1次，即只匹配"a"后面1个任意字符。
a.*	abefacgad	abefacgad	*表示重复零次或多次，即匹配"a"后面0或多个任意字符。
a.+	abefacgad	abefacgad	+表示重复1次或多次，即只匹配"a"后面1个或多个任意字符。
a.{1,2}	abefacgad	abe acg ad	{1,2}匹配1到2次任意字符。

Note: The previous *, +,? And so on are greedy matching, that is, matching as much as possible, followed by? Number makes it an inert match

正则	待匹配字符	匹配结果	说明
a.*?	abefacgad	a a a	惰性匹配

3. Character set [] [^]

正则	待匹配字符	匹配结果	说明
a[befcgd]*	abefacgad	abef acg ad	表示匹配"a"后面[befcgd]的字符任意次
a[^f]*	abefacgad	abe acgad	表示匹配1个不是"f"的字符任意次
[\d]	412a3bc	4 1 2 3	表示匹配任意1个数字，匹配到4个结果
[\d]+	412a3bc	412 3	表示匹配任意个数字，匹配到2个结果

4. Grouping () and or [^]

The ID card number is a string with a length of 15 or 18 characters. If it is 15 digits, it is all composed of numbers, and the first digit cannot be 0; If it is 18 bits, the first 17 bits are all numbers, and the last bit may be a number or x. Let's try to express it with regularity:

正则	待匹配字符	匹配结果	说明
^[1-9]\d{13,16}[0-9x]$	110101198001017032	110101198001017032	表示可以匹配1个正确的身份证号
^[1-9]\d{13,16}[0-9x]$	1101011980010170	1101011980010170	表示也可以匹配这串数字，但这并不是1个正确的身份证号码，它是1个16位的数字
^[1-9]\d{14}(\d{2}[0-9x])?$	1101011980010170	False	现在不会匹配错误的身份证号了 ()表示分组，将\d{2}[0-9x]分成1组，就可以整体约束他们出现的次数为0-1次
^([1-9]\d{16}[0-9x]\|[1-9]\d{14})$	110105199812067023	110105199812067023	表示先匹配[1-9]\d{16}[0-9x]如果没有匹配上就匹配[1-9]\d{14}

5. Escape character\

In regular expressions, there are many metacharacters with special meanings, such as\ n and\ s. If you want to match normal "\ n" instead of "newline character" in regular expressions, you need to escape "\" to "\\".

In python, both regular expressions and the contents to be matched appear in the form of strings, in which\ also has special meanings and needs to be escaped. So if you match "\ n" once, and you have to write '\\ n' in the string, you have to write "\\ n" in the regular, which is too much trouble. At this point, we use the concept of r '\ n', and the regularity at this time is r '\ n'.

正则	待匹配字符	匹配结果	说明
\n	\n	False	因为在正则表达式中\是有特殊意义的字符，所以要匹配\n本身，用表达式\n无法匹配
\\n	\n	True	转义\之后变成\\，即可匹配
"\\\\n"	'\\n'	True	如果在python中，字符串中的'\'也需要转义，所以每1个字符串'\'又需要转义1次
r'\\n'	r'\n'	True	在字符串之前加r，让整个字符串不转义

6. Greedy matching

Greedy matching: When matching is satisfied, match the string as long as possible. By default, greedy matching is adopted

正则	待匹配字符	匹配结果	说明
<.*>	<script>... <script>	<script>... <script>	默认为贪婪匹配模式，会匹配尽量长的字符串
<.*?>	<script>... <script>	<script> <script>	加上？为将贪婪匹配模式转为非贪婪匹配模式，会匹配尽量短的字符串

Several commonly used non-greedy matching Pattern


*?  Repeat as many times as possible, but as few times as possible 
+?  Repetition 1 Times or more, but as few repetitions as possible 
??  Repetition 0 Secondary or 1 Times, but as few repetitions as possible 
{n,m}?  Repetition n To m Times, but as few repetitions as possible 
{n,}?  Repetition n More than two times, but as few repetitions as possible

. *? Usage of


.  Is an arbitrary character 
*  Is to take  0  To   Infinite length 
?  Non-greedy mode. 
 Where is it 1 From the start   Take as few arbitrary characters as possible, 1 I don't write it alone. He mostly uses it: 
.*?x

 Is to take the character of any length before it until 1 A x Appear

2. re Module

(1) Constants, attributes

1. re. A (re.ASCII)

Have\ w,\ W,\ b,\ B,\ d,\ D,\ s, and\ S perform ASCII-only matches the complete Unicode match replacement. This is only meaningful for Unicode mode, but ignored for byte mode.

2. re. I (re. IGNORECASE)

Perform case-insensitive matching; Similar expressions [A-Z] will match lowercase letters.

3. re. L (re.LOCALE)

Make\ w,\ W,\ b,\ B, and case-sensitive matches dependent on the current locale. This flag can only be used with Byte Mode 1. This flag is not recommended because the locale mechanism is very unreliable, it can only handle one "culture" at a time, and it is only suitable for 8-bit locales. By default, Unicode matching is enabled for Unicode (str) mode in Python 3 and can handle different locales/languages.

4. re. M (re. MULTILINE)

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately after each newline); The pattern character '$' matches at the end of the string and at the end of each line (immediately before each line break). By default, '^' matches only immediately before the beginning of the string, the end of the string '$', and the line break (if any) at the end of the string.

5. re. S (re. DOTALL)

Match the '.' special character with any character, including the newline character; Without this flag, '.' will match anything except the newline character.

(2) Common methods

1. re. compile (pattern, flags = 0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using match (), search (), and other methods described below


>>> prog = re.compile('\d{2}') #  Regular object 

>>> prog.search('12abc')
<_sre.SRE_Match object; span=(0, 2), match='12'>
>>> prog.search('12abc').group() #  By calling group() Method to get the matching string , If the strings do not match, return None . 
'12'

>>> prog.match('123abc')
<_sre.SRE_Match object; span=(0, 2), match='12'>
>>> prog.match('123abc').group()
'12'
>>>

2. re. search (pattern, string, flags = 0)

Scan the string for the first place where the regular expression pattern produces a match, and return the corresponding match object. None Returns if no position in the string matches the pattern; Otherwise false is returned. Note that this is different from finding a zero-length match at a point in the string.


# Matching in this string will only match 1 Objects >>> re.search('\w+','abcde').group()
'abcde'
>>> re.search('a','abcde').group()
'a'
>>>

3. re. match (pattern, string, flags = 0)

If zero or more characters at the beginning of the string match the regular expression pattern, the corresponding matching object is returned. None Returns if the string does not match the pattern; Otherwise false is returned. Note that this is different from zero-length matching.


#  Same as search, However, matching at the beginning of the string will only match 1 Objects 
>>> re.match('a','abcade').group()
'a'
>>> re.match('\w+','abc123de').group()
'abc123de'
>>> re.match('\D+','abc123de').group() # Non-numerical 
'abc'
>>>

4. re. fullmatch (pattern, string, flags = 0)

If the entire string matches the regular expression pattern, the corresponding match object is returned. None Returns if the string does not match the pattern; Otherwise false is returned. Note that this is different from zero-length matching.


>>> re.fullmatch('\w+','abcade').group()
'abcade'
>>> re.fullmatch('abcade','abcade').group()
'abcade'
>>>

5. re. split (pattern, string, maxsplit = 0, flags = 0)

Split strings by appearing patterns. If capture parentheses are used in pattern, the text of all groups in the schema will also be returned as part 1 of the resulting list. If maxsplit is not zero, maxsplit splitting occurs at most, and the rest of the string is returned as the last element of the list.


>>> re.split('[ab]', 'abcd') #  Press first 'a' Segmentation obtains '' And 'bcd', In the right '' And 'bcd' Press separately 'b' Segmentation 
['', '', 'cd']
>>> re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

If there is a capture group in the delimiter and the matching group matches at the beginning of the string, the result starts with an empty string. The same is true at the end of the string:


>>> re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

6. re. findall (pattern, string, flags = 0)

Returns all non-overlapping matches of pattern in string as an string list. Scan the string from left to right and return the matches in the order they were found. If there are 1 or more groups in the schema, a list of 1 groups is returned; Otherwise, 1 list is returned. If the schema contains multiple groups, this will be a list of 1 tuples. Null matches are included in the result.


>>> re.findall('a', 'This is a beautiful place!')
['a', 'a', 'a']
>>>

7. re. finditer (pattern, string, flags = 0)

Returns 1 iterator that produces a match object among all non-overlapping matches of RE patterns of type string. Scan the string from left to right and return the matches in the order they were found. Null matches are included in the result.


>>> re.finditer('[ab]', 'This is a beautiful place!')
<callable_iterator object at 0x0000000000DCDA90> # Iterator object 
>>> ret=re.finditer('[ab]', 'This is a beautiful place!')
>>> next(ret).group() # Check under 1 Matching values 
'a'
>>> [i.group() for i in ret] # View all remaining matching values 
['b', 'a', 'a']
>>>

8. re. sub (pattern, repl, string, count = 0, flags = 0)

Returns the string obtained by replacing the leftmost non-overlapping pattern in the string with the replacement repl. If the pattern cannot be found, the returned string remains unchanged. repl can be a string or a function; If it is a string, any backslash escapes are processed. That is, convert it to a single newline character, convert it to a carriage return, and so on. The count parameter indicates the number of times the matched content is replaced


.  Is an arbitrary character 
*  Is to take  0  To   Infinite length 
?  Non-greedy mode. 
 Where is it 1 From the start   Take as few arbitrary characters as possible, 1 I don't write it alone. He mostly uses it: 
.*?x

 Is to take the character of any length before it until 1 A x Appear

0 9. re. subn (pattern, repl, string, count = 0, flags = 0)

Performs the same operation as sub (), but returns 1 tuple. (new_string, number_of_subs_made)


.  Is an arbitrary character 
*  Is to take  0  To   Infinite length 
?  Non-greedy mode. 
 Where is it 1 From the start   Take as few arbitrary characters as possible, 1 I don't write it alone. He mostly uses it: 
.*?x

 Is to take the character of any length before it until 1 A x Appear

1 10. re. escape (pattern)

All character patterns in escape except ASCII letters, numbers, and '_'. This is useful if you want to match any literal string that may contain regular expression metacharacters.


.  Is an arbitrary character 
*  Is to take  0  To   Infinite length 
?  Non-greedy mode. 
 Where is it 1 From the start   Take as few arbitrary characters as possible, 1 I don't write it alone. He mostly uses it: 
.*?x

 Is to take the character of any length before it until 1 A x Appear

2 11. search () and match () methods

Python provides two raw operations based on regular expressions: re. match () matches only at the beginning of the string, re. search () checks for matches, and checks for matches anywhere in the string (this is the default setting for Perl).


>>> re.match("c", "abcdef") #Not match
>>> re.search("c", "abcdef") #match
<_sre.SRE_Match object; span=(2, 3), match='c'>
>>>

The regular expression '^' at the beginning can be used for search () to restrict matches at the beginning of a string:


.  Is an arbitrary character 
*  Is to take  0  To   Infinite length 
?  Non-greedy mode. 
 Where is it 1 From the start   Take as few arbitrary characters as possible, 1 I don't write it alone. He mostly uses it: 
.*?x

 Is to take the character of any length before it until 1 A x Appear

Reference:

https://docs.python.org/3.6/library/re.html

https://www.cnblogs.com/Eva-J/articles/7228075.html#_label7