java regularly matches examples of Chinese characters in a tags in HTML
- 2020-05-27 05:44:24
- OfStack
This article demonstrates the example of java regular matching Chinese characters in a tags in HTML. I will share it with you for your reference as follows:
Today, one of my friends in the group asked me a question about a regular expression, as follows:
<a href='www.baidu.comds=id32434#comment'rewr> , 432</a>
453543
<a guhll,,l>a1 , 123 hello 123 ? </a>
<a href=id=32434#comment'ewrer> , 2</a>
<a> Text in the tag </a>
Now match that the content contains Chinese but the attributes of the tag do not contain comment < a > Chinese characters in labels.
The solution is as follows:
1. First, match out the one excluding comment < a > The label;
2. Match the Chinese language twice in the matching results;
The code is as follows:
package com.mmq.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* @use matching HTML the <a> Chinese characters in the label
* @ProjectName stuff
* @Author mumaoqiang
* @FullName com.mmq.regex.MatchChineseCharacters.java
* @JDK 1.6.0
* @Version 1.0
*/
public class MatchChineseCharacters {
/**
* According to the input content, match out contains Chinese but does not contain comment the <a> Chinese characters in the label
* @param source The content to match
* @return <a> Chinese characters in the label
*/
public static String matchChineseCharacters(String source) {
// Match out contains Chinese but does not comment the <a> The label
String reg = "<a((?!comment).)*?>([^<>]*?[\\u4e00-\\u9fa5]+[^<>]*?)+(?=</a>)";
Pattern pattern = Pattern.compile(reg);
Matcher matcher = pattern.matcher(source);
StringBuilder character = new StringBuilder();
while(matcher.find()){
String result = matcher.group();
System.out.println(result);
// Perform on the results 2 Subregular, match out the Chinese character
String reg1 = "[\\u4e00-\\u9fa5]+";
Pattern p1 = Pattern.compile(reg1);
Matcher m1 = p1.matcher(result);
while(m1.find()){
character.append(m1.group());
}
//System.out.println(character.toString());
}
return character.toString();
}
public static void main(String[] args) {
String result = matchChineseCharacters("<a href='www.baidu.comds=id32434#comment'rewr> , 432</a>453543<a guhll,,l>a1 , 123 hello 123 ? </a><a href=id=32434#comment'ewrer> , 2</a><a> Text in the tag </a>");
System.out.println(result);
}
}
The output results are as follows:
<a guhll,,l>a1 , 123 hello 123 ?
<a> Text in the tag
Special how are you the text in the label
Here is an explanation:
String reg = "<a((?!comment).)*?>([^<>]*?[\\u4e00-\\u9fa5]+[^<>]*?)+(?=</a>)";
This match contains Chinese but the attributes of the tag do not contain comment < a > You cannot use backward lookups in the regex of tags ? < =, because looking back can only be something of a fixed length, here < a > The attribute in the tag is not sure, so it cannot be used. [\\u4e00-\\u9fa5]+ match Chinese string; And (the & # 63; = < /a > ) use look forward ? =, the end tag will not be included in the result < /a > .
So the problem was solved. It is also easy to improve if you want to match the specified content in the specified tag. If there is a better regular, please also leave a message to learn from each other.
PS: here are two more convenient regular expression tools for your reference:
JavaScript regular expression online testing tool:
http://tools.ofstack.com/regex/javascript
Online regular expression generation tool:
http://tools.ofstack.com/regex/create_reg
I hope this article is helpful to you java programming.