golang parsing web page sharps goquery usage
- 2020-06-07 04:39:13
- OfStack
preface
This article mainly introduces the golang analysis web page sharps goquery use related content, share for your reference and study, the following words do not say much, let's have a look at the detailed introduction.
java with Jsoup, nodejs with cheerio, can be quite convenient to parse the web page, in golang language also found a web page parsing tool, quite easy to use, selector like jQuery1
The installation
go get github.com/PuerkitoBio/goquery
use
readme. demo in md
package main
import (
"fmt"
"log"
"github.com/PuerkitoBio/goquery"
)
func ExampleScrape() {
doc, err := goquery.NewDocument("http://metalsucks.net")
if err != nil {
log.Fatal(err)
}
// Find the review items
doc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {
// For each item found, get the band and title
band := s.Find("a").Text()
title := s.Find("i").Text()
fmt.Printf("Review %d: %s - %s\n", i, band, title)
})
}
func main() {
ExampleScrape()
}
The code problem
Chinese web pages are often confused because it defaults to utf8, which is when transcoders are needed
Install iconv - go
go get github.com/djimenez/iconv-go
Method of use
func ExampleScrape() {
res, err := http.Get(baseUrl)
if err != nil {
fmt.Println(err.Error())
} else {
defer res.Body.Close()
utfBody, err := iconv.NewReader(res.Body, "gb2312", "utf-8")
if err != nil {
fmt.Println(err.Error())
} else {
doc, err := goquery.NewDocumentFromReader(utfBody)
// And now you can use it doc To get the structural data in the web page
// Such as
doc.Find("li").Each(func(i int, s *goquery.Selection) {
fmt.Println(i, s.Text())
})
}
}
}
The advanced
Some sites will set Cookie, Referer, etc. You can set the request header before http sends the request
This does not belong to goquery, you can see net/http package in golang for more information
baseUrl:="http://baidu.com"
client:=&http.Client{}
req, err := http.NewRequest("GET", baseUrl, nil)
req.Header.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
req.Header.Add("Referer", baseUrl)
req.Header.Add("Cookie", "your cookie") // You can also go through req.Cookie() The way to set cookie
res, err := client.Do(req)
defer res.Body.Close()
// And then finally just take the res To pass to goquery You can parse the web page
doc, err := goquery.NewDocumentFromResponse(res)
conclusion
reference