Introduction to Python regular expressions (primer)

  • 2020-05-19 05:00:47
  • OfStack

primers

First of all, what is a regular expression?

Regular expressions, also known as normal expressions, normal expressions, regular expressions, regular expressions, regular expressions (English: Regular Expression, often abbreviated as regex, regexp or RE in code), are a concept in computer science. Regular expressions use a single string to describe, match, or match a string that matches a syntactic rule. In many text editors, regular expressions are used to retrieve and replace text that matches a pattern.

Many programming languages support string manipulation using regular expressions. For example, a powerful regular expression engine is built into Perl. The concept of regular expressions was first popularized by tools in Unix, such as sed and grep. Regular expressions are often abbreviated to "regex," with singular Numbers regexp, regex, and plural Numbers regexps, regexes, regexen.

https quote from wikipedia: / / zh wikipedia. org/wiki E5 A3 AD E6 / % % % % % 88% % 99% E8 A1 E5 BE BE E8 A8 % % % % % % % BC F 8

Definition is definition, too serious to use. Let's take a chestnut: suppose you were writing a reptile and you got it

1 web page HTML source code. There's one of them

<html><body><h1>hello world<h1></body></html>

You want to extract this hello world, but if you only know python string manipulation, then the first reaction might be


s = <html><body><h1>hello world<h1></body></html>
start_index = s.find('<h1>')

And then you look down from there to the next one < h1 > It's ok to do this, but it's a hassle, isn't it? Need to consider more than one label, 1 not careful to match more things, and if you want to be very accurate to match, and have to add more circular judgment, the efficiency is too low.

In this case, regular expressions are the helper of choice.

Dry goods start

Entry level

Let's continue with our example. So what do we do with this expression if we deal with it regularistically?


import re
key = r"<html><body><h1>hello world<h1></body></html>"# This is the text that you want to match 
p1 = r"(?<=<h1>).+?(?=<h1>)"# This is our regular expression rule, you can't understand what it means now 
pattern1 = re.compile(p1)# We are compiling this regular expression 
matcher1 = re.search(pattern1,key)# Search the source text for parts that match the regular expression 
print matcher1.group(0)# Print it out 

You can try running the above code and see if it's as simple as we thought it would be (the blogger is in python2.7). To look down. And regular expressions are actually much simpler than they seem.

Let's start with the most basic regular expressions.

Suppose our idea is to match all the "python" in a string to. Let's try 1 how do we do that


import re
key = r"javapythonhtmlvhdl"# This is the source text 
p1 = r"python"# This is the regular expression we wrote 
pattern1 = re.compile(p1)# Again, compilation 
matcher1 = re.search(pattern1,key)# Same query 
print matcher1.group(0)

After reading this code, are you like, oh my god? That's a regular expression, right? Just write it down, right?

Sure, regular expressions aren't as weird as they seem, and what you see is exactly what you want to match if you don't deliberately change the meaning of the 1's.

So, to clear your head, think of a regular expression as being 1 like the string you want to match. We will evolve as we go along

primary

0. Both python and regular expressions are case sensitive, so when you replace "python" with "Python" in the above example, you won't be able to match your beloved python.

1. Go back to the one in the first example <h1>hello world<h1> Matching. What if I wrote it like this?


import re
key = r"<h1>hello world<h1>"# The source text 
p1 = r"<h1>.+<h1>"# We write the regular expression, what will be the following 
pattern1 = re.compile(p1)
print pattern1.findall(key)# I didn't see it. How do I write it findall ? What's changed? 

With entry-level experience, we know those two <h1> It's just a regular character, but what the hell is in the middle?

. A character in a regular expression can represent any character (including itself)

findall returns a list of all the required elements, including the list it returns when there is only one element.

Wit: you may suddenly ask: what if I just want to match "." What happens when everything comes back to me? There is a character \ in a regular expression, which if you have a lot of programming experience, you will find to be an "escape character" in a lot of places. In regular expressions, this symbol is often used to turn special symbols into ordinary ones, and to turn ordinary symbols into special 23333. (it is not a special "2333".

For example, if you really want to match the email address "chuxiuhong@hit.edu.cn" (my email address), you can write the regular expression like this:


import re
key = r"afiouwehrfuichuxiuhong@hit.edu.cnaskdjhfiosueh"
p1 = r"chuxiuhong@hit\.edu\.cn"
pattern1 = re.compile(p1)
print pattern1.findall(key)

See? We're here . Is preceded by an escape character \ , but it doesn't mean to match "\." it means to match only "."
I don't know if you are careful, but did you find that we used it for the first time . And then we have 1 + ? So what does this plus sign do?
It's not hard to think, we said," . The character in a regular expression means that it can represent any character (including itself), but "hello world" is not a single character.
+ is used to repeat the previous character or subexpression one or more times.
For example, the expression "ab+" can match "abbbbb", but it cannot match "a". It requires you to have an b, more or less. If you ask me if I have an "all or nothing" expression, the answer is yes.
* following other symbols means that it can be matched to 0 or more times
Let's say we have a link in wang ye, which may be both http:// and https://, what do we do?


import re
key = r"http://www.nsfbuhwe.com and https://www.auhfisna.com"# A made-up url, never mind 
p1 = r"https*://"# Look at that asterisk! 
pattern1 = re.compile(p1)
print pattern1.findall(key)

The output

['http://', 'https://']

2. For example, if we have a string "cat hat mat qat", you will find that the first three are actual words, and the last one is made up by me. If you already know that "at" is preceded by "c", "h", "m" and that's the first part of the word, you want to match that. Based on what you've learned, would you want to write three regular expressions to match? Not really. Because there's one way to do it

<h1>hello world<h1>0 Matches any one of the inside characters

Or take a chestnut, we found that ah, some programmers more excessive, in <html></html> What should we do if we can't catch what we want because of the mixed use of case and case on the labels? Is it to write 16*16 regular expressions to match one by one? no


import re
key = r"lalala<hTml>hello</Html>heiheihei"
p1 = r"<[Hh][Tt][Mm][Ll]>.+?</[Hh][Tt][Mm][Ll]>"
pattern1 = re.compile(p1)
print pattern1.findall(key)

The output

['<hTml>hello</Html>']

Now that we have a range of matches, naturally there is a range of exclusion.

[^] Represents that all characters match except those contained internally

Or cat hat mat, qat this example, we want to match except qat, it should be so written:


import re
key = r"mat cat hat pat"
p1 = r"[^p]at"# This means that in addition to p Outlier matching 
pattern1 = re.compile(p1)
print pattern1.findall(key)

The output

To make it easier to write simple regular expressions, it itself provides the following

正则表达式 代表的匹配字符
[0-9] 0123456789任意之1
[a-z] 小写字母任意之1
[A-Z] 大写字母任意之1
\d 等同于[0-9]
\D 等同于[^0-9]匹配非数字
\w 等同于[a-z0-9A-Z_]匹配大小写字母、数字和下划线
\W 等同于[^a-z0-9A-Z_]等同于上1条取非

3. With this introduction, we may have grasped the general structure of regular expressions, but we often encounter some matching inaccuracies in the real world. For example:


import re
key = r"chuxiuhong@hit.edu.cn"
p1 = r"@.+\."# I want to match @ behind 1 Until" . "Between, in this case hit
pattern1 = re.compile(p1)
print pattern1.findall(key)

The output

['@hit.edu.']

Yo! How can you get more? My ideal result would be @hit. Why did you add it to me? This is because the regular expression defaults to "greedy," and as we said before, the "+" represents a character that is repeated 1 or more times. But we didn't specify how many times that was. So it will "greedily" give us as many matches as possible, in this case to the last "."

How do we solve this problem? Just add a "? "after the" + ". It'll be ok.


import re
key = r"chuxiuhong@hit.edu.cn"
p1 = r"@.+?\."# I want to match @ behind 1 Until" . "Between, in this case hit
pattern1 = re.compile(p1)
print pattern1.findall(key)

The output

['@hit.']

I added 1 "?" We change the greedy "+" to the lazy "+". This applies to [abc]+,\w* and so on.

Quiz: the above example can be used to get the same result without lazy matching

** personal advice: when you use "+","*", 1 must decide whether to use the greedy or lazy type, especially when you use a wide range of projects, because it is likely to be more matching characters back to you!! * *

In order to accurately control the number of repetitions, regular expressions are also provided

{a, b} (representative a < = number of matches < =b)

So for example, we have sas,saas,saaas, we want sas and saas, what do we do with that?


import re
key = r"<html><body><h1>hello world<h1></body></html>"# This is the text that you want to match 
p1 = r"(?<=<h1>).+?(?=<h1>)"# This is our regular expression rule, you can't understand what it means now 
pattern1 = re.compile(p1)# We are compiling this regular expression 
matcher1 = re.search(pattern1,key)# Search the source text for parts that match the regular expression 
print matcher1.group(0)# Print it out 
0

The output

['saas', 'sas']

If you omit the 2 in {1,2}, that means at least 1 match, then it's equivalent to?

If you omit the 1 in {1,2}, that means at most 2 matches.

Here are some metacharacters in regular expressions and what they do

元字符 说明
. 代表任意字符
\
[ ] 匹配内部的任1字符或子表达式
[^] 对字符集和取非
- 定义1个区间
\ 对下1字符取非(通常是普通变特殊,特殊变普通)
* 匹配前面的字符或者子表达式0次或多次
*? 惰性匹配上1个
+ 匹配前1个字符或子表达式1次或多次
+? 惰性匹配上1个
? 匹配前1个字符或子表达式0次或1次重复
{n} 匹配前1个字符或子表达式
{m,n} 匹配前1个字符或子表达式至少m次至多n次
{n,} 匹配前1个字符或者子表达式至少n次
{n,}? 前1个的惰性匹配
^ 匹配字符串的开头
\A 匹配字符串开头
$ 匹配字符串结束
[\b] 退格字符
\c 匹配1个控制字符
\d 匹配任意数字
\D 匹配数字以外的字符
\t 匹配制表符
\w 匹配任意数字字母下划线
\W 不匹配数字字母下划线


Related articles: