Examples of using Python regular expressions

  • 2020-04-02 13:55:35
  • OfStack

As a concept, regular expressions are not unique to Python. However, there are some subtle differences in the way regular expressions are used in Python.

This article is part of a series of articles on Python regular expressions. In the first article in this series, we'll focus on how to use regular expressions in Python and highlight some features that are unique to Python.

We'll look at some ways to search and find strings in Python. Then we'll discuss how to use groups to handle the children of matching objects that we find.

The regular expression module in Python that we're interested in using is usually called 're'.


>>> import re

1. Primitive type string in Python

The Python compiler USES '\' (backslash) to represent escape characters in string constants.

If the backslash is followed by a string of special characters that the compiler can recognize, the entire escape sequence is replaced with the corresponding special character (for example, '\n' is replaced by the compiler with a newline character).

This presents a problem for regular expressions in Python, however, because the 're' module also USES backslashes to escape special characters in regular expressions (such as * and +).

The mix of the two means that sometimes you have to escape the character itself (when the special character is recognized by both Python and regular expression compilers), but at other times you don't have to (if the special character is recognized only by the Python compiler).

Instead of trying to figure out how many backslashes are needed, we can use the original string instead.

Primitive type strings can be created simply by prefixed the double quotation mark of a normal string with the character 'r'. When a string is of a primitive type, the Python compiler does not attempt to make any substitutions. Essentially, you're telling the compiler not to interfere with your strings at all.


>>> string = 'This is annormal string'
>>> rawString = r'and this is anraw string'
>>> print string

This is a normal string


>>> print rawString
and this is anraw string

This is a primitive type string.
Regular expressions are used in Python for lookup

The 're' module provides several methods for exactly querying the input string. The methods we will discuss are:


re.match()
re.search()
re.findall()

Each method receives a regular expression and a string to match. Let's look at each of these methods in more detail to see how they work and how they differ.

2. Use re. Match to find the start of the pool match

Let's look at the match() method first. The match() method works by finding a match only if the pattern matches at the beginning of the string being searched.

For example, call mathch() on the string 'dog cat dog', and the lookup pattern' dog' will match:


>>> re.match(r'dog', 'dog cat dog')
<_sre.SRE_Match object at 0xb743e720<
>>> match = re.match(r'dog', 'dog cat dog')
>>> match.group(0)
'dog'

We'll talk more about the group() method later. Now, all we need to know is that we called it with 0 as its argument, and the group() method returns the matching pattern found.

I've also briefly skipped the returned SRE_Match object, which we'll discuss shortly.

However, if we call the math() method on the same string and look for the pattern 'cat', we will not find a match.


>>> re.match(r'cat', 'dog cat dog')
>>>

3. Use re. Search to find a match to any location

The search() method is similar to match(), except that the search() method does not restrict us to looking for a match only from the beginning of the string, so looking for 'cat' in our example string will find a match:


search(r'cat', 'dog cat dog')
>>> match.group(0)
'cat'

The search() method, however, stops looking after it finds a match, so in our example string the searc() method looks for 'dog' only to find its first location.


>>> match = re.search(r'dog', 'dog cat dog')
>>> match.group(0)
'dog'

4. Use re. Findall to match all objects

By far the most common lookup method I use in Python is the findall() method. When we call the findall() method, we can simply get a list of all the matching patterns instead of getting the match object (we'll talk more about the match object later). It's easier for me. Calling the findall() method on the sample string we get:


['dog', 'dog']
>>> re.findall(r'cat', 'dog cat dog')
['cat']

5. Use match-start and match-end methods

So what is the 'match' object that the previous search() and match() methods returned to us earlier?

Instead of simply returning the matching part of a string, the "match object" returned by search() and match() is actually a wrapper class about matching substrings.

Earlier you saw that I could get a matching substring by calling the group() method, (which we'll see in the next section is actually quite useful when dealing with grouping), but the matching object also contains more information about the matching substring.

For example, the match object can tell us the start and end of the match in the original string:


>>> match = re.search(r'dog', 'dog cat dog')
>>> match.start()
0
>>> match.end()
3

Knowing this information can be very useful.

6. Use mathch.group to group by Numbers

As I mentioned earlier, matching objects are very handy when dealing with grouping.

Grouping is the ability to locate a specific substring of an entire regular expression. We can define a group as part of the entire regular expression, and then individually locate the content that matches that part.

Let's take a look at how it works:


>>> contactInfo = 'Doe, John: 555-1212'

The string I just created is like a fragment that I pulled out of someone's address book. We can match this line with a regular expression:


>>> re.search(r'w+, w+: S+', contactInfo)
<_sre.SRE_Match object at 0xb74e1ad8<

By surrounding specific parts of the regular expression with parentheses (the characters' (' and ')'), we can group the content and then work with the subgroups separately.


>>> match = re.search(r'(w+), (w+): (S+)', contactInfo)

These groups can be obtained by using the group() method of grouping objects. They can be located by the number order in which they appear in the regular expression from left to right (starting at 1) :


>>> match.group(1)
'Doe'
>>> match.group(2)
'John'
>>> match.group(3)
'555-1212'

The reason the group's ordinal number starts at 1 is because the 0th group is reserved to hold all matched objects (as we saw when we learned about the match() and search() methods earlier).


>>> match.group(0)
'Doe, John: 555-1212'

7. Group by alias using match.group

Sometimes, especially when a regular expression has many groups, it can be impractical to locate groups by their order of occurrence. Python also allows you to specify a group name with the following statement:


>>> match = re.search(r'(?P<last>w+), (?P<first>w+): (?P<phone>S+)', contactInfo)

We can still use the group() method to get the contents of the group, but we need to use the group name we specified instead of the group number we used before.


>>> match.group('last')
'Doe'
>>> match.group('first')
'John'
>>> match.group('phone')
'555-1212'

This greatly enhances the clarity and readability of the code. You can imagine that as regular expressions become more and more complex, it will become more and more difficult to figure out what a group is capturing. Naming your groups will tell you and your readers exactly what you want.

Although the findall() method does not return a grouping object, it can also use grouping. Similarly, the findall() method returns a collection of tuples, where the NTH element in each tuple corresponds to the NTH group in the regular expression.


>>> re.findall(r'(w+), (w+): (S+)', contactInfo)
[('Doe', 'John', '555-1212')]

However, naming groups does not apply to the findall() method.

In this article, we covered some of the basics of using regular expressions in Python. We learned about primitive string types (and some of the headaches they can help you solve when using regular expressions). We also learned how to use the match(), search(), and findall() methods for basic queries, and how to use grouping to handle children of matched objects.

As always, the official Python documentation for the re module is a great resource if you want to see more on this topic.

In future articles, we will discuss the use of regular expressions in Python in more depth. We'll learn more fully about matching objects, how to use them to make substitutions in strings, and even use them to parse Python data structures from text files.


Related articles: