golang parsing web page sharps goquery usage

  • 2020-06-07 04:39:13
  • OfStack

preface

This article mainly introduces the golang analysis web page sharps goquery use related content, share for your reference and study, the following words do not say much, let's have a look at the detailed introduction.

java with Jsoup, nodejs with cheerio, can be quite convenient to parse the web page, in golang language also found a web page parsing tool, quite easy to use, selector like jQuery1

The installation


go get github.com/PuerkitoBio/goquery

use

readme. demo in md


package main

import (
 "fmt"
 "log"

 "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
 doc, err := goquery.NewDocument("http://metalsucks.net")
 if err != nil {
 log.Fatal(err)
 }

 // Find the review items
 doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {
 // For each item found, get the band and title
 band := s.Find("a").Text()
 title := s.Find("i").Text()
 fmt.Printf("Review %d: %s - %s\n", i, band, title)
 })
}

func main() {
 ExampleScrape()
}

The code problem

Chinese web pages are often confused because it defaults to utf8, which is when transcoders are needed

Install iconv - go


go get github.com/djimenez/iconv-go

Method of use


func ExampleScrape() {
 res, err := http.Get(baseUrl)
 if err != nil {
 fmt.Println(err.Error())
 } else {
 defer res.Body.Close()
 utfBody, err := iconv.NewReader(res.Body, "gb2312", "utf-8")
 if err != nil {
  fmt.Println(err.Error())
 } else {
  doc, err := goquery.NewDocumentFromReader(utfBody)
  //  And now you can use it doc To get the structural data in the web page 
  //  Such as 
  doc.Find("li").Each(func(i int, s *goquery.Selection) {
  fmt.Println(i, s.Text())
  })
 }
 }
}

The advanced

Some sites will set Cookie, Referer, etc. You can set the request header before http sends the request

This does not belong to goquery, you can see net/http package in golang for more information


baseUrl:="http://baidu.com"
client:=&http.Client{}
req, err := http.NewRequest("GET", baseUrl, nil)
req.Header.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
req.Header.Add("Referer", baseUrl)
req.Header.Add("Cookie", "your cookie") //  You can also go through req.Cookie() The way to set cookie
res, err := client.Do(req)
defer res.Body.Close()
// And then finally just take the res To pass to goquery You can parse the web page 
doc, err := goquery.NewDocumentFromResponse(res)

conclusion

reference

https://github.com/PuerkitoBio/goquery https://github.com/PuerkitoBio/goquery/issues/185 https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks#handle-non-utf8-html-pages

Related articles: