python regular Expression function tutorial [Classic]

2020-06-03 07:10:17
OfStack

This article describes the regular expression capabilities implemented by python. To share for your reference, specific as follows:

Above:

First, what is a regular expression (Regular Expression)?

For example, we want to determine whether the string "adi_e32fv,Ls" contains the substring "e32f". For example, we look for the last name "wang" in an txt file containing millions of names and the name ending in "5" and print it out. The results are: "Wang 5", "Wang Xiao 5", "Wang Da 5", "Wang Xiao 5"...

We used to use string functions to find it, but the code can be complicated to implement. Today, regular expressions only require 1 sentence of re. findall(' wang.*? 5',txt1) will do! Regular expressions are the most basic knowledge of writing web crawlers and can be used in html to collect urls that meet certain string requirements. Here is a personal summary of the basics of regular expressions.

(Operating environment: 32-bit Win8 system, operating tool: python2.7.9+Eclipse.)

Body:

1. First, import the re module of python.

2. Meta character. ^ $* + ? {} [] \ | ()

The findall (str1, str2) method in the re module returns the string matching the str1 format in the string str2. For example, matching 'dit' in the string 'dit dot det,dct dit dot' results as follows:


str1 = 'dit dot det,dct dit dot'
print re.findall('dit',str1)

Results:


['dit', 'dit']

| USES: 'dit|dct' for dit or dct.


str1 = 'dit dot det,dct dit dot'
print re.findall('dit|dct',str1)

Results:


['dit', 'dct', 'dit']

[] : [ic] means i or c. For example, 'd[ic]t' means both dit and dct. It is equivalent to 'dit|dct' :


str1 = 'dit dot det,dct dit dot'
print re.findall('d[ic]t',str1)

Results:


['dit', 'dct', 'dit']

^ action 1: [^ic] ^ in addition to i and c:


str1 = 'dit dot det,dct dit dot'
print re.findall('d[^ic]t',str1)

Results:


['dot', 'det', 'dot']

^ action 2: ^dit indicates that the substring dit is at the beginning and dct is not at the beginning:


str1 = 'dit dot det,dct dit dot'
print re.findall('^dit',str1)
print re.findall('^dct',str1)

Results:


['dit'][]

dot$indicates that the substring dot is at the end, but dct is not at the end:


['dit', 'dit']

Results:


['dot'][]

d.t means d and t omit 1 character:


str1 = 'dit dot det,dct dit dot'
print re.findall('d.t',str1)

Results:


['dit', 'dit']

di+t means that one or more 'i' is omitted between d and t:


['dit', 'dit']

Results:


['dit', 'dit']

* Role: di*t means zero or more 'i' omitted between d and t:


['dit', 'dit']

Results:


['dt', 'dit', 'diit']

Often, '.' is used with '+' or '*'. '.+' means 1, at most, any element is omitted, '.*' means 0, at most, any element is omitted:


str1 = 'd dt dit diit det'
print re.findall('d.+t',str1)
print re.findall('d.*t',str1)

Results:


['dit', 'dit']

? Look at the matching result of.+, 'dit', 'dot' also meet the matching condition of 'd.+t', while the output is the oldest string that meets the matching condition 'dit dot det,dct dit dot', this is called greedy matching. If you want to output the shortest matching string, just add '+' after '? ': (Note: the same is true for '*', just add' after '*'? ')


str1 = 'd dt dit diit det'
print re.findall('d.+?t',str1)

Results:


['dit', 'dot', 'det', 'dct', 'dit', 'dot']

? Role 2: di? t means that i is dispensable, i.e. both dt and dit meet the matching conditions:


str1 = 'd dt dit diit det'
print re.findall('di?t',str1)

Results:


['dt', 'dit']

{} function 1: di{n}t means there are n 'i' between d and t:


str1 = 'dt dit diit diiit diiiit'
print re.findall('di{2}t',str1)

Results:


['diit']

{} role 2: di{n,m}t means there are n to i' between d and t:


str1 = 'dt dit diit diiit diiiit'
print re.findall('di{1,3}t',str1)

Results:


['dit', 'diit', 'diiit']

n and m can be omitted. {n,} means n to any; {,m} means 0 to m; {,} means any, and '*' function 1 like:


str1 = 'dt dit diit diiit diiiit'
print re.findall('di{1,}t',str1)
print re.findall('di{,3}t',str1)
print re.findall('di{,}t',str1)

Results:


['dit', 'diit', 'diiit', 'diiiit']
   ['dt', 'dit', 'diit', 'diiit']
   ['dt', 'dit', 'diit', 'diiit', 'diiiit']

Effect 1: Cancel metacharacter to escape character:


str1 = '^abc ^abc'
print re.findall('^abc',str1)
print re.findall('\^abc',str1)

Results:


[]['^abc', '^abc']

\ Effect 2: Predefined characters


str1 = '12 abc 345 efgh'
print re.findall('\d+',str1)
print re.findall('\w+',str1)

Results:


['12', '345']
   ['12', 'abc', '345', 'efgh']

() function: after matching string, only output the contents of matching string '()' :


str1 = '12abcd34'
print re.findall('12abcd34',str1)
print re.findall('1(2a)bcd34',str1)
print re.findall('1(2a)bc(d3)4',str1)

Results:


['12abcd34']
   ['2a']
   [('2a', 'd3')]

3. Main methods in re module: findall(), finditer(), match(), search(), compile(), split(), sub(), subn().

re.findall(pattern,string,flags = 0)

How it works: Search string from left to right for a string that matches pattern. The result is returned as list.


str1 = 'ab cd'
print re.findall('\w+',str1)

Results: ['ab', 'cd']

re.finditer(pattern,string,flags = 0)

How it works: It has the same function as ES260en.es261EN, but the result is returned as an iterator.


str1 = 'ab cd'
iter1 = re.finditer('\w+',str1)
for a in iter1:
  print a.group(),a.span()

Results:

ab (0, 2)
cd (3, 5)

(Note: ES272en.group () returns the string that satisfies the matching adjustment, a.span () returns the starting and ending positions of the string)

re.search(pattern,string,flags = 0)

Effect: Search from left to right in string for a string that matches pattern. If there is no match, return None; otherwise, return 1 instance of search.


str1 = 'ab cd'
result = re.search('cd',str1)
if result == None:
  print 'None'
else:
  print result.group(),result.start(),result.end()

Results: cd 3 5

re.match(pattern,string,flags = 0)

Effect: To determine whether the header of string matches pattern, return the instance of match if it does, or RETURN None if it does not.


str1 = 'ab cd'
result = re.match('cd',str1)
if result == None:
  print 'None'
else:
  print result.group(),result.start(),result.end()

Results: None

re.compile(pattern,flags = 0)

Effect: Compiles the matching format pattern and returns an instance object. Compiling regular expressions first can greatly improve the matching speed.


str1 = 'dit dot det,dct dit dot'
print re.findall('d[ic]t',str1)

Results: [' ab]

re.split(pattern,string,maxsplit = 0,flags = 0)

Role: To do segmentation when string matches pattern:


str1 = 'dit dot det,dct dit dot'
print re.findall('d[ic]t',str1)

Results:

['ab', 'c', 'de']
['12', '34', '56', '78', '90']

re.sub(pattern,repl,string,count = 0,flags = 0)

Effect: Replace repl with repl in string:


str1 = 'dit dot det,dct dit dot'
print re.findall('d[ic]t',str1)

Results: a123de

re.subn(pattern,repl,string,count = 0,flags = 0)

Effect: It has the same function as ES345en.sub (), but returns an extra number representing the number of substitutions


str1 = 'dit dot det,dct dit dot'
print re.findall('d[ic]t',str1)

Results: (' a123de123e, 2)

PS: Here are two more handy regular expression tools for your reference:

JavaScript Regular Expression online test tool:
http://tools.ofstack.com/regex/javascript

Regular expression online generation tool:
http://tools.ofstack.com/regex/create_reg

More about Python related content to view this site project: the Python regular expression usage summary ", "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article has been helpful in Python programming.