Detailed Explanation of Java String Regular Expression

  • 2021-11-10 09:34:37
  • OfStack

Directory 1. Rule Table 1. Character 2. Character Class 3. Boundary Matcher 4. Logical Operator 5. Quantifier 2. Pattern Class 1. Pattern Class Instance Acquisition-compile Method 2. split Method 3. Match Marker Parameters in Pattern 3. Matcher Class Summary

In the daily Java back-end development process, it is inevitable to analyze the data fields, and naturally it is indispensable to operate the string, which includes the content of the regular expression block, which involves Pattern class and Macher class in Java package. This blog summarizes the content and common usage of this block. The main reference material of this blog is the 4th edition of "Java Programming Ideas".

Lead to the content of this blog with one question. The problem is to check whether a string begins with an uppercase letter and ends with a period.


String len="^[A-Z].*[\\.]$";
System.out.println("Taaa.".matches(len));//true
System.out.println("taaa.".matches(len));//false
System.out.println("Taaa".matches(len));//false

1. * [\.] is a regular expression that matches a string. Represents the beginning of line 1, and the [A Z] table is any letter from A to Z, which is a regular expression to match a string. ^ stands for the start of line 1, and the [A-Z] table is any letter from A to Z, which is a regular expression to match a string. Represents the beginning of row 1, and the [A-Z] table is any letter from A to Z, representing the end of row 1.

1. Rules sheet

The rule table defines how regular expressions should be written. This kind of things don't need to be remembered deliberately, just write them when you use them, and remember them naturally when you use them more.

1. Characters

B 指定字符B
\xhh 106进制为oxhh的字符
\uhhhh 106进制表示为oxhhhh的Unicode字符
\t 制表符Tab
\n 换行符
\r 回车
\f 换页
\e 转义

2. Character classes.

. 任意字符
[abc] 包含a、b和c的任何字符(和a|b|c作用相同)
[^abc] 除了a、b和c之外的任何字符(否定)
[a-zA-Z] 从a到z或从A到Z的任何字符(范围)
[abc[hij]] 任意a、b、c、h、i和j字符(与a|b|c|h|i|j作用相同)
[a-z&&[hij]] 任意h、i或j
\s 空白符(空格、tab、换行、换页和回车)
\S 非空白符
\d 数字
\D 非数字
\w 词字符[a-zA-Z0-9]
\W 非词字符

3. Boundary Matcher

^ 1行的起始
$ 1行的结束
\b 词的边界
\B 非词的边界
\G 前1个匹配的结束

4. Logical Operators

XY Y跟在X后面
X|Y X或Y
(X) 捕获组

5. Quantifiers

贪婪型 勉强型 占有型 如何匹配
X? X?? X?+ 1个或零个X
X* X*? X*+ 零个或多个X
X+ X+? X++ 1个或多个X
X{n} X{n}? X{n}+ 恰好n次X
X{n,} X{n,}? X{n,}+ 至少n次X
X{n,m} X{n,m}? X{n,m}+ X至少n次,且不超过m次

2. Class Pattern

1. Instance retrieval of the Pattern class-compile method

1. The following are some important methods under Pattern class source code


private Pattern(String p, int f) {
    pattern = p;
    flags = f;

    // to use UNICODE_CASE if UNICODE_CHARACTER_CLASS present
    if ((flags & UNICODE_CHARACTER_CLASS) != 0)
        flags |= UNICODE_CASE;

    // Reset group index count
    capturingGroupCount = 1;
    localCount = 0;

    if (!pattern.isEmpty()) {
        try {
            compile();
        } catch (StackOverflowError soe) {
            throw error("Stack overflow during pattern compilation");
        }
    } else {
        root = new Start(lastAccept);
        matchRoot = lastAccept;
    }
}

As you can see from the Pattern class code, the constructor is private, so you cannot use new to get an Pattern class object instance.

After carefully querying the code under Pattern class, it is found that the instance object of Pattern class is obtained by static method compile.


public static Pattern compile(String regex) {
    return new Pattern(regex, 0);
}

Generate an Pattern class instance object through the compile method, and then pass the desired matching field into the matcher () method of the Pattern object. The matcher () method will return to generate an Matcher class instance object, which also has many corresponding methods. Therefore, the Matcher class cannot create an instance through new, but through matcher method


public Matcher matcher(CharSequence input) {
    if (!compiled) {
        synchronized(this) {
            if (!compiled)
                compile();
        }
    }
    Matcher m = new Matcher(this, input);
    return m;
}

public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

Therefore, the problem at the beginning of the previous blog can be realized in two other ways:

Mode 1:


Pattern len = Pattern.compile("^[A-Z].*[\\.]$");
Matcher matcher = len.matcher("Taaa.");
boolean matches = matcher.matches();
System.out.println(matches);

As you can see from the source code, the matches method is a class static method, so there is no need to use the Macher class instance object to call the method. You can use:

Mode 2:


System.out.println(Pattern.matches("^[A-Z].*[\\.]$","Taaa."));

2. split method

The following is the source code for the Split method:


public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Method parsing:

input The sequence of characters to be split;

limit Result threshold; Splits the input sequence according to the specified pattern.

limit Parameter Action:

The limit parameter controls the number of times the pattern is applied, which affects the length of the result array.

如果 n 大于零 n-1 is applied to the pattern at most, the length of the array is no longer than n, and the last entry of the array will contain all inputs except the last match delimiter.

如果 n 非正 The number of times the pattern will be applied is unlimited, and the array can be of any length.

如果 n 为零 The number of times the pattern can be applied is unlimited, the array can be of any length, and the trailing empty string is discarded.

Detailed explanation: Assuming input= "boo: and: foo" and the matching character is "o", it can be seen that the pattern can be applied up to 4 times and the maximum length of the array is 5;

1. When limit=-2, the number of times the pattern is applied is unlimited and the array can be of any length; The speculative pattern is applied 4 times, the length of the array is 5, and the array is {"b", "", ": and: f", "", "};

2. When limit=2, the pattern is applied at most once, the length of the array is not greater than 2, and the second element contains all inputs except the last matching delimiter; The speculative pattern is applied once, the length of the array is 2, and the array is {"b", "o: and: foo"};

3. When limit=7, the pattern is applied at most 6 times, and the length of the array is no more than 7; The speculative pattern is applied 4 times, the length of the array is 5, and the array is {"b", "", ": and: f", "", "};

4. When limit=0, the number of times the pattern is applied is not limited, the array can be of any length, and the tail empty string will be discarded; The speculative pattern is applied 4 times, the length of the array is 3, and the array is {"b", "", ": and: f"}.

Here m. find is like an iterator, traversing the input string

3. Matching tag parameters in 3. Pattern

编译标记 效果
Pattern.CANON_EQ 启用规范等价。当且仅当两个字符的“正规分解(canonicaldecomposition)”都完全相同的情况下,才认定匹配。默认情况下,不考虑“规范相等性(canonical equivalence)”。
Pattern.CASE_INSENSITIVE 启用不区分大小写的匹配。默认情况下,大小写不敏感的匹配只适用于US-ASCII字符集。这个标志能让表达式忽略大小写进行匹配,要想对Unicode字符进行大小不敏感的匹配,只要将UNICODE_CASE与这个标志合起来就行了。
Pattern.COMMENTS 模式中允许空白和注释。在这种模式下,匹配时会忽略(正则表达式里的)空格字符(不是指表达式里的“\s”,而是指表达式里的空格,tab,回车之类)。注释从#开始,1直到这行结束。可以通过嵌入式的标志来启用Unix行模式。
Pattern.DOTALL 启用dotall模式。在这种模式下,表达式‘.'可以匹配任意字符,包括表示1行的结束符。默认情况下,表达式‘.'不匹配行的结束符。
Pattern.MULTILINE 启用多行模式。在这种模式下,‘^'和‘ ' 分 别 匹 配 1 行 的 开 始 和 结 束 。 此 外 , ‘ ' 仍 然 匹 配 字 符 串 的 开 始 , ‘ '分别匹配1行的开始和结束。此外,‘^'仍然匹配字符串的开始,‘ '分别匹配1行的开始和结束。此外,‘'仍然匹配字符串的开始,‘'也匹配字符串的结束。默认情况下,这两个表达式仅仅匹配字符串的开始和结束。
Pattern.UNICODE_CASE 启用Unicode感知的大小写折叠。在这个模式下,如果你还启用了CASE_INSENSITIVE标志,那么它会对Unicode字符进行大小写不敏感的匹配。默认情况下,大小写不敏感的匹配只适用于US-ASCII字符集。
Pattern.UNIX_LINES 启用Unix行模式。在这个模式下,只有‘\n'才被认作1行的中止,并且与‘.'、‘^'、以及‘$'进行匹配。

Pattern.CASE_INSENSITIVE , Pattern.COMMENTS , Pattern.MULTILINE These three are commonly used.

3. Category Matcher

Compared with the source code of Matcher constructor, it can be seen that the constructor assigns the reference of Pattern object to the variable parentPattern in Matcher, and the target string to the variable text; And created arrays groups and locals.

The array groups is the storage used by the group. Stores first and last information for each capture group that is currently matched.

groups [0] stores first of Group Zero, groups [1] stores last of Group Zero, groups [2] stores first of Group 1, groups [3] stores last of Group 1, and so on.

Matcher class method is very much, here is not a detailed interpretation of the source code of each method, follow-up if free will be in-depth study under 1. The common methods are summarized as follows:

方法名 功能作用
public int groupCount() 返回此匹配器中的捕获组数
public String group() 实际上是调用了group(int group) ,只不过参数是0
public String group(int group) 返回当前查找而获得的与组匹配的所有子串内容
public int start() 返回当前匹配的子串的第1个字符在目标字符串中的索引位置
public int start(int group) 返回当前匹配的指定组中的子串的第1个字符在目标字符串中的索引位置 。
public int end() 返回当前匹配的子串的最后1个字符的下1个位置在目标字符串中的索引位置 。
public int end(int group) 返回当前匹配的的指定组中的子串的最后1个字符的下1个位置在目标字符串中的索引位置
public boolean find() 在目标字符串里查找下1个匹配子串
public boolean find(int start) 重置此匹配器,然后尝试查找匹配该模式,从指定的位置开始查找下1个匹配的子串
public int regionStart() 报告此匹配器区域的开始索引。
public int regionEnd() 报告此匹配器区域的结束索引(不包括)。
public Matcher region(int start,int end) 设置此匹配器的区域限制。重置匹配器,然后设置区域,使其从start参数指定的索引开始,到 end参数指定的索引结束(不包括end索引处的字符)。
public boolean lookingAt() 从目标字符串开始位置进行匹配。只有在有匹配且匹配的某1子串中包含目标字符串第1个字符的情况下才会返回true。
public boolean matches() 只有完全匹配时才会返回true。
public Matcher appendReplacement(StringBuffer sb, String replacement) 将当前匹配子串替换为指定字符串,并将从上次匹配结束后到本次匹配结束后之间的字符串添加到1个StringBuffer对象中,最后返回其字符串表示形式。
public StringBuffer appendTail(StringBuffer sb) 将最后1次匹配工作后剩余的字符串添加到1个StringBuffer对象里。
public String replaceAll(String replacement) 将匹配的子串用指定的字符串替换。
public String replaceFirst(String replacement) 将匹配的第1个子串用指定的字符串替换。

Summarize

This article is here, I hope to give you help, but also hope that you can pay more attention to this site more content!


Related articles: