Introduction to Python regular expressions (intermediate)

  • 2020-05-17 05:55:42
  • OfStack

Primary article links: / / www. ofstack. com article / 99372. htm

We said in the last post that in this post, we will introduce subexpressions, looking forward and backward, and backtracking references. By the time this article begins, you should be able to write regular expressions for the most part, except that backtracking references are irreplaceable in some cases.

1. Subexpression

The concept of subexpression is particularly easy to understand. Instead, it treats the combination of a few characters as one large "character." Hard to understand? For example, we want to match a character in the form of an IP address (leaving aside the rationality of the numerical range for now, let's leave that for a post-study question). How do we write expressions for addresses that look like 192.168.1.1?

答案1 \d+.?\d+.?\d+.?\d+

No, one of them is too complicated, and the other one can't even control the number of digits

答案2 \d+{1,3}.?\d+{1,3}.?\d+{1,3}.?\d+{1,3}

1 is common and complicated, but at least it can control the number in a reasonable range

答案3 (\d+{1,3}\.){3}\d+{1,3}\.

Using the subexpression, 123. This kind of number plus the decimal point is regarded as a whole character, and the number of times of repeated matching is specified. It is both concise and good. So as long as you put a couple of character combinations in parentheses, then you can treat everything inside the parentheses as one character, and you can add all the meta-characters that we talked about before to control the match.

2. Look forward and backward

Now, we're finally at the point where we're looking forward and backward for this one. Why do you say it's finally here? Remember our example at the beginning of the primer?

If you are writing a crawler, you get the HTML source for a web page. There's one html

<html><body><h1>hello world</h1></body></html>

You want to extract this hello world


import re
key = r"<html><body><h1>hello world</h1></body></html>"# This is the text that you want to match 
p1 = r"(?<=<h1>).+?(?=</h1>)"# This is our regular expression rule, you can't understand what it means now 
pattern1 = re.compile(p1)# We are compiling this regular expression 
matcher1 = re.search(pattern1,key)# Search the source text for parts that match the regular expression 
print matcher1.group(0)# Print it out 

This regular expression

p1 = r"(?<=<h1>).+?(?=<h1>)"

see (?<=<h1>) and (?=<h1>) Yet? 1 the & # 63; < = indicates that there must be a character before it is matched < h1 > , ? = indicates that the character must be matched < h1 >

In simple terms, the character you want to match is XX, but it has to be a string like AXXB, so you can write a regular expression like this

p = r"(?<=A)XX(?=B)"

The string that matches is XX. Also, lookups forward and backward need not occur simultaneously. If you want, you can just write that one of the conditions satisfies.

So you don't have to remember which is looking forward and which is looking backwards. Just remember ? < = followed by the prefix requirement, ? = followed by the suffix requirement.

Essentially, looking forward and looking back actually matches the entire string, AXXB, but returns only one XX. In other words, if you want, you can avoid looking backwards and forwards, match the string with the suffix, and slice the string.

3. Backreferences

Unlike the previous forward and backward search, this 1 sometimes you do not necessarily go around the past. In some cases, you also have to use backreferences, so backreferences are something you should know if you want to use regular expressions in real life.

Let's start with the very first example.

You were supposed to match < h1 > < /h1 > Now that you know that HTML has multiple levels of title content, you want to extract each level of title content. You might write something like this:

p = r"<h[1-6]>.*?</h[1-6]>"

This way, you can match all of the title content on the HTML page. namely < h1 > < /h1 > to < h6 > < /h6 > Can be extracted. But as we said before, the difficulty in writing regular expressions is not matching what you want, but rather mismatching what you don't want as much as possible. In this case, there is a good chance that you will be corrupted by a use case like the one below.

For example,

<h1>hello world</h3>

Found behind < /h3 > Yet? We don't care how we write this title, but the real thing is that our regular expression will match hello world here as well. This is where backdating comes in. Here's an example:


import re
key = r"<h1>hello world</h3>"
p1 = r"<h([1-6])>.*?</h\1>"
pattern1 = re.compile(p1)
m1 = re.search(pattern1,key)
print m1.group(0)# There's going to be an error, because there's no match, so if you change the source string to </h1>

You can see the effect at the end

See \1? The original position would have been [1-6], but we wrote \1, and as we said before, the escape character \ does the job of turning special characters into 1-like characters, and 1-like characters into special characters. What did the ordinary number one get transferred to? Here 1 represents the 1st subexpression, that is, it is dynamic, and it varies with what the 1st subexpression previously matched. For example, if the previous subexpression is [1-6] and 1 is found in the actual string, then the \1 will be 1, and if the previous subexpression is found in the actual string, then the \1 will be 2.

Similarly,\ 2,\3,... That's the second and third subexpressions.

So a backtrace reference is a "dynamic" regular expression within a regular expression that allows you to match as the actual situation changes.

So that's the end of the intermediate part. There are a lot of details about regular expressions that I haven't written yet, and there are a lot of metacharacters that I haven't explained, but once I've got the rundown, what's left is something like table lookup construction.

For those of you who want to see this, take a look at regular expressions, from which several examples are drawn in the primer and this article.


Related articles: