golang crawls a web page and analyzes the link methods it contains
- 2020-07-21 08:25:24
- OfStack
1. Download the standard package, "golang. org x/net/html"
2. Install git first and use the git command to download
git clone https://github.com/golang/net
Put the net package under the GOROOT path
Such as:
Mine: GOROOT = E:\go\
So the final directory is: E:\go\src\ golang.org \x\net
Note: If the golang. org and x folders are not available, they are created
4. Create fetch directory under which main.go file is created. main.go file code is as follows:
package main
import (
"os"
"net/http"
"fmt"
"io/ioutil"
)
func main() {
for _, url := range os.Args[1:] {
resp, err := http.Get(url)
if err != nil {
fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
}
b, err := ioutil.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "fetch: reading %s: %v\n", url, err)
os.Exit(1)
}
fmt.Printf("%s",b)
}
}
5. Compile fetch
go build test.com\justin\demo\fetch
Note: ES50en. com\justin\demo\ is my project path, which is compiled according to its own project path.
6. Execute the fetch.exe file
fetch.exe https://www.qq.com
Note: https: / / www. qq. com is to climb the url, the configuration is correct, will print out the url HTML content. If not, check that the above steps are correct.
7. The web page has been crawled, so the rest is to analyze the links contained in the page, create findlinks directory, under which to create main. go file, main. go file code content is as follows:
package main
import (
"os"
"fmt"
"golang.org/x/net/html"
)
func main() {
doc, err := html.Parse(os.Stdin)
if err != nil {
fmt.Fprint(os.Stderr, "findlinks: %v\n", err)
os.Exit(1)
}
for _, link := range visit(nil, doc) {
fmt.Println(link)
}
}
func visit(links []string, n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
links = append(links, a.Val)
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
links = visit(links, c)
}
return links
}
8. Compile findlinks
go build test.com\justin\demo\findlinks
Note: ES82en. com\justin\demo\ is my project path, which is compiled according to my own project path.
9. Execute the findlinks.ES89en document
fetch.exe https://www.qq.com | findlinks.exe
> 10. Post-execution results: Get various forms of hyperlinks