Java regular expressions are prone to summary errors

  • 2020-04-01 04:30:40
  • OfStack

An overview,
Regular expressions are an important tool in Java for dealing with strings and text.
Java's processing of regular expressions focuses on the following two classes:
Java. Util. Regex. Matcher    Model class : to represent a compiled regular expression.
Java. Util. Regex. Pattern    Matching categories: Pattern matches the abstract result of a string representation.
(unfortunately, Java Doc does not provide a concept of responsibility for these two classes.)  
For example, a simple example:


import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

 
public class TestRegx { 
    public static void main(String[] args) { 
        Pattern p = Pattern.compile("f(.+?)k"); 
        Matcher m = p.matcher("fckfkkfkf"); 
        while (m.find()) { 
            String s0 = m.group(); 
            String s1 = m.group(1); 
            System.out.println(s0 + "||" + s1); 
        } 
        System.out.println("---------"); 
        m.reset("fucking!"); 
        while (m.find()) { 
            System.out.println(m.group()); 
        } 

        Pattern p1 = Pattern.compile("f(.+?)i(.+?)h"); 
        Matcher m1 = p1.matcher("finishabigfishfrish"); 
        while (m1.find()) { 
            String s0 = m1.group(); 
            String s1 = m1.group(1); 
            String s2 = m1.group(2); 
            System.out.println(s0 + "||" + s1 + "||" + s2); 
        } 

        System.out.println("---------"); 
        Pattern p3 = Pattern.compile("(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])"); 
        Matcher m3 = p3.matcher("1900-01-01 2007/08/13 1900.01.01 1900 01 01 1900-01.01 1900 13 01 1900 02 31"); 
        while (m3.find()) { 
            System.out.println(m3.group()); 
        } 
    } 
}
 

Output results:
FCK | | c
FKK | | k
---------
fuck
Finish | | in the | | s
Fishfrish | | ishfr | | s
---------
1900-01-01
2007/08/13
1900.01.01
1900 01 01
The 1900-02 31

Process finished with exit code 0
  Two, some easy to confuse the problem
  1. Java's handling of backslashes  
In other languages, \\ means to insert a character \;
In the Java language, \ represents a backslash to insert a regular expression, and the characters that follow it have special meaning.  
A. Predefined character classes
. Any character (may or may not match the line terminator)
\d number: [0-9]
\D non-number: [^0-9]
\s blank character: [\t\n\x0B\f\r]
\S non-white space character: [^\ S]
\w word character: [a-za-z_0-9]
\W non-word character: [^\ W]
  But if you look at the above program, it's not hard to see:
\d was written when it was actually used (link: #) ;
In Java regular expressions, if you want to insert a \ character, you need to write \\\\ in the regular expression because the following APIDoc definition \\ represents a backslash.
However, if you represent carriage returns, line breaks, etc., in the regular representation, there is no need to add more backslashes. For example, press enter \r to write \r.& \;
B. character
X x
\\ backslash character
\0n character n (0 < = n < = 7)
\0nn character nn (0 < = n < = 7)
\0mnn n (0 < = m < = 3, 0 < = n < = 7)
\ XHH character hh with hexadecimal value 0x
\uhhhh character HHHH with hexadecimal value 0x
\t TAB ('\u0009')
\n new line (newline) ('\u000A')
\r return ('\u000D')
\f page change ('\u000C')
\a alarm (bell) ('\u0007')
\e escape ('\u001B')
\cx corresponds to the control character of x
2, the Matcher. The find () : Try to find the next child of the sequence of characters that matches the pattern. This method starts at the beginning of the character sequence, and if the previous call to the method succeeds and the matcher has not been reset since then, it starts with the first character that was not matched by the previous match operation, that is, if the subsequence matching the pattern was found the previous time, it starts after that subsequence.
3, the Matcher. Matchers () : Determines if the entire sequence of characters matches the pattern. Can be used when multiple strings are examined consecutively with a Matcher object
Matcher.reset() : reset the Matcher, discard all its explicit state information, and set its add position to zero.
Or the Matcher. Reset (CharSequence input)   Reset this matcher with a new input sequence to reuse the matcher.
4. Group concept Groups are regular expressions divided by parentheses and can be referenced by Numbers. The group number starts at 0, and pairs of braces indicate that there are several groups, and groups can be nested, the group number 0 indicates the entire expression, the group number 1 indicates the first group, and so on.
For example: A(B)C(D)E has three groups in the regular formula, group 0 is ABCDE, group 1 is B, and group 2 is D.
A((B)C)(D)E has four groups: group 0 is ABCDE, group 1 is BC, and group 2 is B. Group 3 is C, and group 4 is D.  
Int groupCount () : Returns the number of groups that match its pattern, excluding group 0.
String group () : Returns the 0th group of the previous matching operation, such as find().
String group (int group) : Returns the subsequence matched by the group specified during the previous match operation. If the match succeeds, but the specified group fails to match any part of the character sequence, null is returned.
Int start(int group) : returns the initial index of the subsequence matched by the group specified during the previous match operation.
Int end(int group) : returns the last index +1 of the subsequence matched by the group specified during the previous match operation.
5. Control of matching range
The most perverted is the lookingAt() method, with a confusing name that requires a careful look at APIDoc.  
Start ()   Returns the original index that was previously matched.
End ()   Returns the offset after the last matched character.
The public Boolean lookingAt() attempts to match the input sequence from the beginning of the region to the pattern.
Like the matches method, this method always starts at the beginning of the region; Unlike this, it does not have to match the entire region.
If the match is successful, more information can be obtained through the start, end, and group methods.
Returns:
Returns true if and only if the prefix of the input sequence matches the pattern of the matcher.
This site for you to sort out these easy to mix knowledge points, but it is still not comprehensive, you need to continue to accumulate in the following study, the biggest difficulty in regular expressions is skilled in writing regular expressions, you should learn against the difficult points, I believe that there will be some gains.


Related articles: