Python regular expressions tutorial 3: greedy and non greedy features

2020-05-26 09:33:29
OfStack

After a brief introduction to the basics and capture of Python regular expressions, in this article I will summarize the greedy/non-greedy nature of regular expressions.

greed

By default, the regular expression performs greedy matching. Greed is choosing the longer of a string of matching lengths. For example, the following regular expression is intended to select the words spoken by the character, but due to the "greedy" characteristic, the match is not appropriate:


>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*)"', sentence)
['why?" and I say "I don\'t know']

For example, the following examples illustrate the greedy nature of regular expressions:


>>> re.findall('hi*', 'hiiiii')
['hiiiii']
>>> re.findall('hi{2,}', 'hiiiii')
['hiiiii']
>>> re.findall('hi{1,3}', 'hiiiii')
['hiii']

Not greed

When we expect the regular expression to match "non-voraciously," we need to explicitly state through the syntax:

{2,5}? Capture 2-5 matches with a low priority

Here, question mark ? It's a little confusing, because it already has its own meaning: the previous match occurs 0 or 1 times. In fact, just remember that when the question mark appears after a regular expression that represents an indefinite number of times, it means a non-greedy match.

Again, with the above examples, the result is as follows:


>>> re.findall('hi*?', 'hiiiii')
['h']
>>> re.findall('hi{2,}?', 'hiiiii')
['hii']
>>> re.findall('hi{1,3}?', 'hiiiii')
['hi']

In another example, non-greedy matching is used, and the result is as follows:


>>> sentence = """You said "why?" and I say "I don't know"."""
>>> re.findall(r'"(.*?)"', sentence)
['why?', "I don't know"]

Capture and non - greed

Strictly speaking, this part is not non-greedy. But since its behavior is similar to that of non-greedy, it was put in 1 for the convenience of memory.

(?=abc) Captures, but does not consume characters, and matches abc

(?!abc) Captured, not consumed, and does not match abc

In the process of regular expression matching, there is actually a process of "character consumption", that is to say, 1 dener 1 character is retrieved (consumed) in the matching process, and the subsequent matching will not retrieve the 1 character.

What's the use of knowing this property? Let me give you an example. For example, we want to find words that appear more than once in a string:


>>> sentence = "Oh what a day, what a lovely day!"
>>> re.findall(r'\b(\w+)\b.*\b\1\b', sentence)
['what']

Such a regular expression obviously won't do the job. Why is that? The reason is that when the first (\w+) matches to what, and the next \1 also matches to the second what, the first substring of "Oh what a day, what" has been consumed by the regular expression, so the subsequent matching will start directly after the second what. Naturally, there is only one word that appears twice.

The solution, then, is the same as the above (? =abc) syntax related. This syntax allows you to group matches without consuming strings! So, the correct way to write it is:


>>> re.findall(r'\b(\w+)\b(?=.*\b\1\b)', sentence)
['what', 'a', 'day']

If we need to match a word that contains at least two different letters, we can use (? ! Grammar of abc:


>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'aa', re.IGNORECASE)
>>> re.search(r'([a-z]).*(?!\1)[a-z]', 'ab', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 2), match='ab'>

conclusion

The above is all about greed in Python regular expressions, I hope the content of this article can bring 1 definite help to your learning or using python, if you have any questions, you can leave a message to communicate, if you have any questions, you can leave a message to communicate. In the next article, I will continue to summarize the usage of API in the Python regular expression re module. Please stay tuned to this site.