Python regular expression of escape problem

  • 2020-04-02 14:26:30
  • OfStack

Let's start with a comparison: I encountered a problem when writing xiami music demo downloader. Because the saved files are named after the title of the music, I encountered some illegal characters such as "border" (hem, that is, you - _ - Windows) when the title will be saved. Then I remembered thunderbolt's solution: replace all illegal characters with underscores.

Hence the use of regular expressions. After a quick search, I wrote down this function:


def sanitize_filename(filename):
return re.sub('[/:*?<>|]', '_', filename)

Recently (link: https://github.com/timothyqiu/xiami-downloader/commit/18817b0646e5e81fd75a9b55ec3a94e8e1f2dd1f) a lot of problems in this function:

Python is different from the Shell in that backslashes are escape characters, whether they are single or double quotes. Luckily, Python is for meaningless escapes \/ The handling is left as is. Even so, sanitize_filename('\\/:*?<>|') Still return \_______ It's not all underlining.

So I felt like I was actually looking at the documentation.

Raw strings

After looking at the documentation, I realized that Python regular expression modules escape independently. For example, to match a backslash character, write the argument as: '\\\ ':

Python escapes strings: \\\\ to \
The re module takes the incoming \\ and interprets it as a regular expression and escapes it as \ according to the escape rules of regular expressions
With that in mind, Raw String is pretty much what it says: a String that won't escape (except for the trailing backslash). So matching a backslash character can be written as r'\ '.

So the above sanitize_filename changes to:


def sanitize_filename(filename):
return re.sub(r'[\/:*?<>|]', '_', filename)

The Regex and Match

So seriously take a look at the re module ~ the following is a running account, for impatient viewing.

The main objects in Python's regular expression module re are actually these two:

The regular expression RegexObject
Matching MatchObject
RegexObject is a regular expression object that owns all operations like match sub. Generated by re.compile(pattern, flag).


>>> email_pattern = re.compile(r'w+@w+.w+')
>>> email_pattern.findall('My email is abc@def.com and his is user@example.com')
['abc@def.com', 'user@example.com']

The methods:

Search matches from any character and returns either MatchObject or None
Match matches from the first character and returns either MatchObject or None
A split returns the List split by a match
Findall returns a List of all matches
Finditr returns the iterator for MatchObject
Sub returns the replaced string
Subn returns (replaced string, number of times replaced)
Functions provided in the re module such as re. Sub re. Match re. Findall can actually be thought of as a shortcut that eliminates the need to create regular expression objects directly. Since the RegexObject object itself can be used repeatedly, this is its advantage over these shortcut functions.

A MatchObject is a MatchObject that represents the result of a regular expression match. Returned by some methods of the RegexObject. Matching objects are always True, and there are a bunch of methods for getting information about groups in regular expressions.


>>> for m in re.finditer(r'(w+)@w+.w+', 'My email is abc@def.com and his is user@example.com'):
... print '%d-%d %s %s' % (m.start(0), m.end(0), m.group(1), m.group(0))
...
12-23 abc abc@def.com
35-51 user user@example.com

reference The Python Standard Library: (link: http://docs.python.org/2/library/re.html)

Related articles: