python extracts strings using regular expressions

2020-05-17 05:55:06
OfStack

preface

The basics of regular expressions are out of the way, but if you're interested, you can click here to extract 1 and there are two ways to extract a string from a single position in the text, and another way to extract a string from multiple consecutive positions. Log analysis will encounter this situation, and I will talk about the corresponding method of 1 below.

1. String extraction of a single position

In this case we can use (.+?) This regular expression is used to extract. For example, a string "a123b", if we want to extract the value 123 between ab, we can use findall with the regular expression, which will return 1 list containing all the appropriate values.

The code is as follows:


import re
str = "a123b"
print re.findall(r"a(.+?)b",str)#
 The output ['123']

1.1 greed and non-greed match

If we have 1 string "a123b456b", if we want to match all the values between a and the last b instead of a and the first b, we can use ? To control the case of regular greed and non-greed matching.

The code is as follows:


import re
str = "a123b456b"

print re.findall(r"a(.+?)b", str)
# The output ['123']#? Control only match 0 or 1 a , So it's only going to output the most recent b The matching case between 

print re.findall(r"a(.+)b", str)
# The output ['123b456']

print re.findall(r"a(.*)b", str)
# The output ['123b456']

1.2 multi-line matching

If you want to match multiple lines, you need to add the re.S and re.M flags. Newline will be matched, by default. Newline will not be matched.

The code is as follows:


str = "a23b\na34b"

re.findall(r"a(\d+)b.+a(\d+)b", str)
# The output []
# Because you can't handle it str There is \n The newline case 

re.findall(r"a(\d+)b.+a(\d+)b", str, re.S)
#s The output [('23', '34')]

When you add re.M, the ^$flag will match every row, and the default ^ and $will match only the first row.

The code is as follows:


str = "a23b\na34b"

re.findall(r"^a(\d+)b", str)
# The output ['23']

re.findall(r"^a(\d+)b", str, re.M)
# The output ['23', '34']

2. String extraction of multiple consecutive positions

This is the case we can use (?P<name>…) This regular expression is used to extract. For example, if we have 1 line webserver's access log: '192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"' , we want to extract all the contents in this line of log, we can write more than one (?P<name>expr) To extract, name can be changed to the variable you name for the location string, expr can be changed to extract the location of the regular.

The code is as follows:


import re
line ='192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" 
"Mozilla/5.0"'
reg = re.compile('^(?P<remote_ip>[^ ]*) (?P<date>[^ ]*) "(?P<request>[^"]*)" 
(?P<status>[^ ]*) (?P<size>[^ ]*) "(?P<referrer>[^"]*)" "(?P<user_agent>[^"]*)"')
regMatch = reg.match(line)
linebits = regMatch.groupdict()
print linebits
for k, v in linebits.items() :
 print k+": "+v

The output result is:


status: 200
referrer: 
request: GET /api HTTP/1.1
user_agent: Mozilla/5.0
date: 25/Oct/2012:14:46:34size: 44
remote_ip: 192.168.0.1

conclusion

The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.