Crawler4j crawls pages using jsoup to parse HTML

  • 2020-04-01 03:15:14
  • OfStack

Crawler4j has a good effect on the page crawling with existing coding, which can be handled by many jquery programmers by using jsoup analysis. However, crawler4j does not specify a coded page for response, which parses into garbled code, which is annoying. In my frustration, I came across a blog post that had been written for a long time and could solve the problem by modifying the contentData encoding in page.load (). This made me feel much more comfortable, and the following problems were all solved.


public void load(HttpEntity entity) throws Exception {
 contentType = null;  
    Header type = entity.getContentType();  
    if (type != null) {  
        contentType = type.getValue();  
    }  

    contentEncoding = null;  
    Header encoding = entity.getContentEncoding();  
    if (encoding != null) {  
        contentEncoding = encoding.getValue();  
    }  

    Charset charset = ContentType.getOrDefault(entity).getCharset();  
    if (charset != null) {  
        contentCharset = charset.displayName();   
    }else{
     contentCharset = "utf-8";
    }

   //The source code
   //contentData = EntityUtils.toByteArray(entity);  
    //The modified code
    contentData = EntityUtils.toString(entity, Charset.forName("gbk")).getBytes();

}


Related articles: