golang crawls a web page and analyzes the link methods it contains

  • 2020-07-21 08:25:24
  • OfStack

1. Download the standard package, "golang. org x/net/html"

2. Install git first and use the git command to download


git clone https://github.com/golang/net

Put the net package under the GOROOT path

Such as:

Mine: GOROOT = E:\go\

So the final directory is: E:\go\src\ golang.org \x\net

Note: If the golang. org and x folders are not available, they are created

4. Create fetch directory under which main.go file is created. main.go file code is as follows:


package main
 
import (
 "os"
 "net/http"
 "fmt"
 "io/ioutil"
)
 
func main() {
 for _, url := range os.Args[1:] {
 resp, err := http.Get(url)
 if err != nil {
  fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
 }
 b, err := ioutil.ReadAll(resp.Body)
 resp.Body.Close()
 if err != nil {
  fmt.Fprintf(os.Stderr, "fetch: reading %s: %v\n", url, err)
  os.Exit(1)
 }
 fmt.Printf("%s",b)
 }
}

5. Compile fetch


go build test.com\justin\demo\fetch

Note: ES50en. com\justin\demo\ is my project path, which is compiled according to its own project path.

6. Execute the fetch.exe file

fetch.exe https://www.qq.com

Note: https: / / www. qq. com is to climb the url, the configuration is correct, will print out the url HTML content. If not, check that the above steps are correct.

7. The web page has been crawled, so the rest is to analyze the links contained in the page, create findlinks directory, under which to create main. go file, main. go file code content is as follows:


package main
 
import (
 "os"
 "fmt"
 "golang.org/x/net/html"
)
 
func main() {
 doc, err := html.Parse(os.Stdin)
 if err != nil {
 fmt.Fprint(os.Stderr, "findlinks: %v\n", err)
 os.Exit(1)
 }
 for _, link := range visit(nil, doc) {
 fmt.Println(link)
 }
}
 
func visit(links []string, n *html.Node) []string {
 if n.Type == html.ElementNode && n.Data == "a" {
 for _, a := range n.Attr {
  if a.Key == "href" {
  links = append(links, a.Val)
  }
 }
 }
 for c := n.FirstChild; c != nil; c = c.NextSibling {
 links = visit(links, c)
 }
 return links
}

8. Compile findlinks


go build test.com\justin\demo\findlinks

Note: ES82en. com\justin\demo\ is my project path, which is compiled according to my own project path.

9. Execute the findlinks.ES89en document


fetch.exe https://www.qq.com | findlinks.exe

> 10. Post-execution results: Get various forms of hyperlinks


Related articles: