Java method to delete comment content in HTML in bulk

  • 2020-04-01 03:10:40
  • OfStack

In fact, there are many ways to delete the comments in the HTML text, here on their own write a method, when the notes, students can refer to the need.

Comments in HTML text have several features:
1. In pairs, where there is a beginning, there must be an end.
2. Comment tag is not nested, comment start tag (hereinafter called < ! --) the next must be the corresponding closing tag (hereinafter referred to as -->) .
3. There may be multiple pairs of comment tags in a line.
Comments can also wrap.
There are the following situations:


<html>  
<!--This is a head-->  
<head>A Head</head>  
<!--This is   
   a div -->  
<div>A Div</div>  
<!--This is   
    a span--><!--span in   
    a div--><div>a div</div>  
<div><span>A span</span><div>  

<!--This is a   
        span--><div>A div</div><!--span in a div-->  
<div><span>A span</span><div>  
<html>  

Ideas:
1. Read one line of text at a time.
2. If the line contains only < ! - and - > And < ! - in - > Before. Directly delete the comment content between the two tags to get the rest of the content.
3. If the line contains only < ! - and - > , but < ! - in - > After. Gets the content between the two tags and notes that it has encountered < ! - label.
4. If the line contains only < ! --, get the contents in front of the label, and note that it has encountered < ! - label.
5. If the line contains only --> , get the contents after the label, and note that it has encountered --> The label.
Step 2,3,4, and 5 for the rest of the line.
Save the rest.
Read the next line.

    public class HtmlCommentHandler {
        
        private static class HtmlCommentDetector {
            private static final String COMMENT_START = "<!--";
            private static final String COMMENT_END = "-->";
            //Is the string an HTML comment line that contains the opening and closing tags of the comment "<! -- -->"
            private static boolean isCommentLine(String line) {
                return containsCommentStartTag(line) && containsCommentEndTag(line) 
                    && line.indexOf(COMMENT_START) < line.indexOf(COMMENT_END);
            }
            //Whether to include the start tag for the comment
            private static boolean containsCommentStartTag(String line) {
                return StringUtils.isNotEmpty(line) &&
 line.indexOf(COMMENT_START) != -1;
            }
            //Contains the closing tag for the comment
            private static boolean containsCommentEndTag(String line) {
                return StringUtils.isNotEmpty(line) &&
 line.indexOf(COMMENT_END) != -1;
            }
            
            private static String deleteCommentInLine(String line) {
                while (isCommentLine(line)) {
                    int start = line.indexOf(COMMENT_START) + COMMENT_START.length();
                    int end = line.indexOf(COMMENT_END);
                    line = line.substring(start, end);
                }
                return line;
            }
            //Gets the content before the start annotation
            private static String getBeforeCommentContent(String line) {
                if (!containsCommentStartTag(line))
                    return line;
                return line.substring(0, line.indexOf(COMMENT_START));
            }
            //Gets what follows the comment line
            private static String getAfterCommentContent(String line) {
                if (!containsCommentEndTag(line))
                    return line;
                return line.substring(line.indexOf(COMMENT_END) + COMMENT_END.length());
            }
        }

        
        public static String readHtmlContentWithoutComment(BufferedReader reader) throws IOException {
            StringBuilder builder = new StringBuilder();
            String line = null;
            //Whether the current line is in the comment
            boolean inComment = false;
            while (ObjectUtils.isNotNull(line = reader.readLine())) {
                //If you include comment tags
                while (HtmlCommentDetector.containsCommentStartTag(line) || 
                        HtmlCommentDetector.containsCommentEndTag(line)) {
                    //Removes the content between comment tags that appear in pairs
                    // <!-- comment -->
                    if (HtmlCommentDetector.isCommentLine(line)) {
                        line = HtmlCommentDetector.deleteCommentInLine(line);
                    }
                    //If it is not a comment line, but there is still a start tag and an end tag, the end tag must precede the start tag
                    // xxx -->content<!--
                    else if (HtmlCommentDetector.containsCommentStartTag(line) && HtmlCommentDetector.containsCommentEndTag(line)) {
                        //Gets the text before the start tag after the end tag, and sets the inComment to true
                        line = HtmlCommentDetector.getAfterCommentContent(line);
                        line = HtmlCommentDetector.getBeforeCommentContent(line);
                        inComment = true;
                    }
                    //If only the start tag exists, because comment tags do not support nesting, lines with only the start tag must not inComment
                    // content <!--
                    else if (!inComment && HtmlCommentDetector.containsCommentStartTag(line)) {
                        //Set the inComment to true. Gets the contents before the start tag
                        inComment = true;
                        line = HtmlCommentDetector.getBeforeCommentContent(line);
                    }
                    //If only the end tag exists, because comment tags do not support nesting, only the line of the end tag must be inComment
                    // -->content
                    else if (inComment && HtmlCommentDetector.containsCommentEndTag(line)) {
                        //Set the inComment to false. Gets the content after the closing tag
                        inComment = false;
                        line = HtmlCommentDetector.getAfterCommentContent(line);
                    }
                    //Save the non-commented contents of the line
                    if (StringUtils.isNotEmpty(line))
                        builder.append(line);
                }
                //Save the line that does not have any comment labels on it and that inComment = false
                if (StringUtils.isNotEmpty(line) && !inComment)
                    builder.append(line);
            }
            return builder.toString();
        }
    }

Of course, there are many other ways to do this, either by regular matching deletions or by starting and ending with a Stack tag.
And so on, the above code has been tested and used, hopefully useful for those of you who need it.


Related articles: