Java USES htmlparser to get the code implementation it wants in HTML

  • 2020-04-01 02:58:32
  • OfStack

These two days need to do something, need to grab some information from other people's web pages. Finally, htmlparser is used to parse HTML.

See it directly from the code:

The first thing to note is that the import package is: the package under import org.htmlparser


List<Mp3> mp3List = new ArrayList<Mp3>();
        try{
            Parser parser = new Parser(htmlStr);//Initialize the Parser, note that the import package is org.htmlparser. There are a lot of parameters here. What I'm doing here is getting good HTML text ahead of time. You can also pass in the URl object
            parser.setEncoding("utf-8");//Setup encoder
            AndFilter filter =
                new AndFilter(
                              new TagNameFilter("div"),
                             new HasAttributeFilter("id","songListWrapper")
              );//Find the div through the filter and the id of the div is songListWrapper
              NodeList nodes = parser.parse(filter);//Get nodes through filter
              Node node = nodes.elementAt(0);
              NodeList nodesChild = node.getChildren();
              Node[] nodesArr = nodesChild.toNodeArray();
              NodeList nodesChild2 = nodesArr[1].getChildren();
              Node[] nodesArr2 = nodesChild2.toNodeArray();
              Node nodeul = nodesArr2[1];
              Node[] nodesli = nodeul.getChildren().toNodeArray();//Resolve nodesli as desired

            
              for(int i=2;i<nodesli.length;i++){
                  //System.out.println(nodesli[i].toHtml());
                  Node tempNode =  nodesli[i];
                  TagNode tagNode = new TagNode();//Get the attributes through the TagNode, and you can only get the attributes of a tag by converting the Node to the TagNode
                  tagNode.setText(tempNode.toHtml());
                  String claStr = tagNode.getAttribute("class");//ClaStr for bb - dotimg clearfix  Song-item-hook {'songItem': {'sid': '113275822', 'sname': 'my requirements are not high ', 'author':' huang bo '}}
                  claStr = claStr.replaceAll(" ", "");
                  if(claStr.indexOf("\?")==-1){
                      Pattern pattern = Pattern.compile("[\s\wa-z\-]+\{'songItem':\{'sid':'([\d]+)','sname':'([\s\S]*)','author':'([\s\S]*)'\}\}");
                      Matcher matcher = pattern.matcher(claStr);
                      if(matcher.find()){
                          Mp3 mp3 = new Mp3();
                          mp3.setSid(matcher.group(1));
                          mp3.setSname(matcher.group(2));
                          mp3.setAuthor(matcher.group(3));
                          mp3List.add(mp3);
                          //for(int j=1;j<=matcher.groupCount();j++){
                              //System.out.print("   "+j+"--->"+matcher.group(j));
                          //}
                      }
                  }
                  //System.out.println(matcher.find());
              }
            }catch(Exception e){
                e.printStackTrace();
            }

The above is what I parsed in the project, it is relatively simple to use, easy to get started.
                          / / / / claStr for bb - dotimg clearfix  Song-item-hook {'songItem': {'sid': '113275822', 'sname': 'my requirements are not high ', 'author':' huang bo '

The content is parsed from the web page.


Related articles: