How to use regular expression details in Swift

  • 2020-06-15 10:21:57
  • OfStack

preface

Regular expressions, also known as regular expressions. Regular Expression, often abbreviated to regex, regexp, or RE in code, is a concept in computer science. Regular tables are often used to retrieve and replace text that conforms to a pattern (rule).

Regular expressions (Regular expression, regex), which allow us to perform complex searches and substitutions between thousands of documents in a matter of seconds, are still widely used more than 50 years after their creation.

Although Swift is a new language, it does not provide a dedicated syntax or class for dealing with regularity. So we can only use the old NSRegularExpression class for regular matching.

In this article, I'll cover the basic use of regular expressions in Swift. We'll cover some of the most important regular expression syntax and some useful extensions, from easy to difficult.

NSRegularExpression: How do I match regular expressions in a string

The NSRegularExpression class lets you replace substrings with regular expressions that describe text in a neat and flexible way. For example, if you want to extract "Taylor Swift" from "My name is Taylor Swift", write a regular expression that matches the text "My name is", followed by any text, and pass it to the NSRegularExpression class.

You can see the code below. Notice that we want to extract the second range, because the first range is the matching string, and the second range is the "Taylor Swift" section.


do {
 let input = "My name is Taylor Swift"
 let regex = try NSRegularExpression(pattern: "My name is (.*)", options: NSRegularExpression.Options.caseInsensitive)
 let matches = regex.matches(in: input, options: [], range: NSRange(location: 0, length: input.utf16.count))

 if let match = matches.first {
  let range = match.range(at:1)
  if let swiftRange = Range(range, in: input) {
   let name = input[swiftRange]
  }
 }
} catch {
 // regex was bad!
}

A detailed explanation of regular expressions

Let's start with a few simple examples for those unfamiliar with regular expressions. Regular expressions, regex for short, are used for fuzzy retrieval in strings. For example, we know that "cat" contains "at", but what if we retrieved all three-letter words ending in "at"?

Regular expressions are used to solve this problem, although their syntax is somewhat tricky due to the basis of ES53en-ES54en.

1. First, define the string you want to retrieve:


let testString = "hat"

An NSRange instance is then created to represent the length of the entire string


let range = NSRange(location: 0, length: testString.utf16.count)

utf16 is used here to avoid problems such as emoticons

2. Then create an instance of NSRegularExpression using regular expression syntax


let regex = try! NSRegularExpression(pattern: "[a-z]at")

[ES71en-ES72en] is used in regular expressions to specify any letter between a and z. In practice you might provide an invalid regular expression, but here we have a hard-coded correct regular expression, so there is no need to look for errors.

3. Finally, call firstMatch(in:) after the created regular expression, enter the string to be retrieved, 1 some special options, and the range of the string. If the string matches the regular expression, the data is returned; otherwise, nil. So if you want to check that the string matches exactly, compare the results from firstMatch(in:) to nil:


regex.firstMatch(in: testString, options: [], range: range) != nil

It is necessary to use NSRange - although API was designed for NSString and does not work well with Swift. Swift String Manifesto may replace it, but it looks a long way off.

The regular expression "[a-ES96en]at" will successfully match "hat" with "cat", "sat", "mat", "bat", and so on -- we'll just focus on what we want to match and NSRegularExpression will take care of it.

Make the NSRegularExpression easier to use

We'll show you more regular expression syntax later, but first, how to make NSRegularExpression a little bit easier to use

Now we need 3 lines of Swift code to match a simple string


let range = NSRange(location: 0, length: testString.utf16.count)
let regex = try! NSRegularExpression(pattern: "[a-z]at")
regex.firstMatch(in: testString, options: [], range: range) != nil

There are many ways to improve, but the most effective is to extend NSRegularExpression to make it easier to create and match expressions.

First line 1:


let regex = try! NSRegularExpression(pattern: "[a-z]at")

I mentioned that creating an instance of NSRegularExpression might cause an error because it might provide an illegal regular expression. For example [ES123en-ES124en, forgotten]

As a result, try is often used! Create an instance of NSRegularExpression. However, this can lead to the destruction of lint tools such as SwiftLint. So a good point is to create a convenient initialization that creates the regular expression correctly, or to generate an assertion that fails at development time.


extension NSRegularExpression {
 convenience init(_ pattern: String) {
  do {
   try self.init(pattern: pattern)
  } catch {
   preconditionFailure("Illegal regular expression: \(pattern).")
  }
 }
}

Note: If your app requires users to write regular expressions, you will need to initialize it with NSRegularExpression(pattern:) to handle errors better.

And then these lines:


let range = NSRange(location: 0, length: testString.utf16.count)
regex.firstMatch(in: testString, options: [], range: range) != nil

Line 1 creates an NSRange containing the entire string, and line 2 looks for first match in the text. But this is a silly approach, because most of the time you want to look up the entire string and use firstMatch(in) versus nil to decide that you're confusing your intent.

So instead, replace it with another extension that includes the following code in a simple matches() method.


extension NSRegularExpression {
 func matches(_ string: String) -> Bool {
  let range = NSRange(location: 0, length: string.utf16.count)
  return firstMatch(in: string, options: [], range: range) != nil
 }
} 

If you combine these two extensions, you can create and retrieve regular expressions much more easily.


let testString = "hat"
0

We can take a step further by using operator overloading to make Swift contain, ~=, operator applicable to regular expressions:


let testString = "hat"
1

With the above code, we can use arbitrary characters on the left side of a sentence and regular expressions on the right.

[

"hat" ~= "[a-z]at"

]

Note: Creating an instance of NSRegularExpression costs 1, so if you want to use a regular expression over and over again, it's best to save the instance of NSRegularExpression.

A tour of Regular expression grammar

We have used [a-ES176en] to represent any letter between "a" and "z", which is a character class in regular expressions. It lets you specify a set of letters to match, either through a designated list of letters, or through a 1-character range.

The regular expression range does not have to be the whole alphabet. You can use [a-ES182en] to exclude letters between "u" and "z". In addition, if you want to specify 1 in particular, just list them separately like this:

[

[csm]at

]

By default, regular expressions distinguish between "Cat" and "Mat", meaning that "[a-ES195en]at" will not be matched. If you want to ignore case, use "[a-ES198en-ES199en]at" or create your own NSRegularExpression object and mark it.caseInsensitive

In addition to case, you can specify a range of Numbers by character class. The most common is [0-9] for any number, or [A-ES206en-z0-9] for any alphanumeric hybrid, or [ES208en-ES209en-ES210en0-9] for hexadecimal Numbers.

If you want to match a sequence of characters, you also need a concept called a quantifier (quantifier). It is used to indicate the number of characters that appear.

The most common is the asterisk quantifier, *, which means match 0 or more. Quantifiers appear after the characters they modify, as follows:


let testString = "hat"
2

This looks for "ca," followed by zero or more letters from "a" to "z," and finally "d" -- it matches "cad," "card," "clamped," and so on.

In addition to *, there are two similar quantifiers + and ? . + means "one or more", somewhat different from the "zero or more" of *. The & # 63; It means "zero or one."

These quantifiers are the basis of regular expressions, and hopefully you can really understand the difference, such as the following three regular expressions

ca[a-z]*d ca[a-z]+d ca[a-z]?d

And think about what matches if you were given the string "cd" or "clamped".

You can specify the number of matches in more detail with curly braces {and} if desired, for example [a-ES240en]{3} means match 3 lowercase letters.

Consider a telephone number format such as 111-1111. If you want to match the format exactly, [0-9-]+ won't work. So we need the regular expression [0-9]{3}-[0-9]{4}, which starts with three Numbers, followed by the join number, followed by four Numbers.

You can also specify a range with braces, which can be bounded or unbounded. For example [a-z]{1,3} matches 1,2, or 3 lowercase letters. [ES247en-ES248en]{3,} means match 3 or more

Finally, metacharacters (ES251en-ES252en) are special characters that have special meaning in regular expressions, and some of the most frequently used are described here.

First of all, it is the most commonly used and most abused. Character. It can match any one character except the newline character. For example, the regular expression c.t matches "cat", but not "cart". If you use. With * quantifiers, it means matching one or more characters except for line breaks, which is probably your most common regular expression.

The reason why.* is often used is also obvious: without having to specifically design a special regular expression,.* can match almost a cut. The problem, however, is that specificity is one of the essentials of regular expressions. You can find exactly 1 character in the text and manipulate it. Too many people rely entirely on.* without realizing that this can cause subtle errors in their expressions.

Using the previous phone number example, we use [0-9]{3}-[0-9]{4} to match a phone number similar to 555-5555. Given that some people write "5555555" or "5555555", we might relax the regular expression condition by 1 to [0-9]{3}.*[0-9]{4}

But this creates a problem. It matches "123-4567", "123-4567890", or "123-456-789012345". In order for [0-9]{3} to match [0-9]{4},.* matches as many characters as possible

So use character classes and quantifiers like [0-9]{3}[-]*[0-9]{4} for three Numbers, followed by zero or more Spaces and connectors, followed by four Numbers. Or use a non-character class that matches characters other than Numbers, such as [0-9]{3}[^0-9]+[0-9]{4}, which matches Spaces, connectors, slashes, and so on, but not Numbers.

conclusion


Related articles: