java regularly matches examples of Chinese characters in a tags in HTML

2020-05-27 05:44:24
OfStack

This article demonstrates the example of java regular matching Chinese characters in a tags in HTML. I will share it with you for your reference as follows:

Today, one of my friends in the group asked me a question about a regular expression, as follows:


<a href='www.baidu.comds=id32434#comment'rewr> , 432</a>
453543
<a guhll,,l>a1 , 123 hello 123 ? </a>
<a href=id=32434#comment'ewrer> , 2</a>
<a> Text in the tag </a>

Now match that the content contains Chinese but the attributes of the tag do not contain comment < a > Chinese characters in labels.

The solution is as follows:

1. First, match out the one excluding comment < a > The label;

2. Match the Chinese language twice in the matching results;

The code is as follows:


package com.mmq.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
 * @use  matching HTML the <a> Chinese characters in the label 
 * @ProjectName stuff
 * @Author mumaoqiang
 * @FullName com.mmq.regex.MatchChineseCharacters.java
 * @JDK 1.6.0
 * @Version 1.0
 */
public class MatchChineseCharacters {
  /**
   *  According to the input content, match out contains Chinese but does not contain comment the <a> Chinese characters in the label 
   * @param source  The content to match 
   * @return <a> Chinese characters in the label 
   */
  public static String matchChineseCharacters(String source) {
    // Match out contains Chinese but does not comment the <a> The label 
    String reg = "<a((?!comment).)*?>([^<>]*?[\\u4e00-\\u9fa5]+[^<>]*?)+(?=</a>)";
    Pattern pattern = Pattern.compile(reg);
    Matcher matcher = pattern.matcher(source);
    StringBuilder character = new StringBuilder();
    while(matcher.find()){
      String result = matcher.group();
      System.out.println(result);
      // Perform on the results 2 Subregular, match out the Chinese character 
      String reg1 = "[\\u4e00-\\u9fa5]+";
      Pattern p1 = Pattern.compile(reg1);
      Matcher m1 = p1.matcher(result);
      while(m1.find()){
        character.append(m1.group());
      }
      //System.out.println(character.toString());
    }
    return character.toString();
  }
  public static void main(String[] args) {
    String result = matchChineseCharacters("<a href='www.baidu.comds=id32434#comment'rewr> , 432</a>453543<a guhll,,l>a1 , 123 hello 123 ? </a><a href=id=32434#comment'ewrer> , 2</a><a> Text in the tag </a>");
    System.out.println(result);
  }
}

The output results are as follows:


<a guhll,,l>a1 , 123 hello 123 ? 
<a> Text in the tag 
 Special how are you the text in the label

Here is an explanation:


String reg = "<a((?!comment).)*?>([^<>]*?[\\u4e00-\\u9fa5]+[^<>]*?)+(?=</a>)";

This match contains Chinese but the attributes of the tag do not contain comment < a > You cannot use backward lookups in the regex of tags ? < =, because looking back can only be something of a fixed length, here < a > The attribute in the tag is not sure, so it cannot be used. [\\u4e00-\\u9fa5]+ match Chinese string; And (the & # 63; = < /a > ) use look forward ? =, the end tag will not be included in the result < /a > .

So the problem was solved. It is also easy to improve if you want to match the specified content in the specified tag. If there is a better regular, please also leave a message to learn from each other.

PS: here are two more convenient regular expression tools for your reference:

JavaScript regular expression online testing tool:
http://tools.ofstack.com/regex/javascript

Online regular expression generation tool:
http://tools.ofstack.com/regex/create_reg

I hope this article is helpful to you java programming.