🚜 A Simple Web Scraper in Go

• 8 min read

In my previous job at Sendwithus, we’d been having trouble writing performant concurrent systems in Python. After many attempts, we came to the conclusion that Python just wasn’t suitable for some of our high throughput tasks, so we started experimenting with Go as a potential replacement.

After making it all the way through the Golang Interactive Tour, which I highly recommend doing so if you haven’t already, I wanted to build something real. The last task in the Go tour is to build a concurrent web crawler, but it faked the fun parts like making HTTP requests and parsing HTML. It was this that motivated me to open my IDE and try it myself. This post will walk you through the steps I tool to build a simple web scraper in Go.

We’ll go over three main topics:

  1. using the net/http package to fetch a web page
  2. using the golang.org/x/net/html to parse an HTML document
  3. using Go concurrency with multi-channel communication

In order to keep this tutorial short, I won’t be accommodating those of you that haven’t yet finished the Go Tour. The tour will teach you everything you need to know to follow along.

Building a Web Scraper

As I mentioned in the introduction, we’ll be building a simple web scraper in Go. Note that I didn’t say web crawler because our scraper will only be going one level deep (maybe I’ll cover crawling in another post).

We’re going to be building a basic command line tool that takes an input of seed URLs, scrapes them, then prints the links it finds on those pages.

Here’s an example of it in action:

$ go run main.go https://schier.co https://insomnia.rest

Found 7 unique urls:

 - https://insomnia.rest
 - https://twitter.com/GregorySchier
 - https://support.insomnia.rest
 - https://chat.insomnia.rest
 - https://github.com/Kong/insomnia
 - https://twitter.com/GetInsomnia
 - https://konghq.com

Now that we know what we’re building, let’s get to the fun part—putting it together.

To make this tutorial easier to digest, I’ll be breaking it down isolated components. After going over each component, I’ll put them all together to form the final product. The first component we’ll be going over is making an HTTP request to fetch some HTML.

1. Fetching a Web Page By URL

Go includes a really good HTTP library out of the box. The http package provides a http.Get(url) method that only requires a few lines of code.

Note that things like error handling are omitted to keep this example short.

//~~~~~~~~~~~~~~~~~~~~~~//
// Make an HTTP request //
//~~~~~~~~~~~~~~~~~~~~~~//

resp, _ := http.Get(url)
bytes, _ := ioutil.ReadAll(resp.Body)

fmt.Println("HTML:\n\n", string(bytes))

resp.Body.Close()

Making an HTTP request is the foundation of a web scraper so now that we know how to do that, we can move on to handling the HTML contents returned.

2. Finding <a> Tags in HTML

Go doesn’t have a core package for parsing HTML but there is on included in the Golang SubRepositores that we can imported from golang.org/x/net/html.

If you’ve never interacted with an XML or HTML tokenizer before, this may take some time to grasp but I believe in you.

The module’s tokenizer splits the HTML document into “tokens” that can be iterated over. So, to find anchor tags (link) we can tokenize the HTML and iterate over the tokens to find the <a> tags. Here are the possible things that a token can represent (documentation):

Token Name Token Description
ErrorToken error during tokenization (or end of document)
TextToken text node (contents of an element)
StartTagToken example <a>
EndTagToken example </a>
SelfClosingTagToken example <br/>
CommentToken example <!-- Hello World -->
DoctypeToken example <!DOCTYPE html>

The code below demonstrates how to find all the opening anchor tags in an HTML document.

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~//
// Parse HTML for Anchor Tags //
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~//

z := html.NewTokenizer(response.Body)

for {
    tt := z.Next()

    switch {
    case tt == html.ErrorToken:
    	// End of the document, we're done
        return
    case tt == html.StartTagToken:
        t := z.Token()

        isAnchor := t.Data == "a"
        if isAnchor {
            fmt.Println("We found a link!")
        }
    }
}

Now that we have found the anchor tags, how do we get the href value? Unfortunately, it’s not as easy as you would expect. A token stores it’s attributes in an array, so have to perform a similar iteration technique.

//~~~~~~~~~~~~~~~~~~~~//
// Find Tag Attribute //
//~~~~~~~~~~~~~~~~~~~~//

for _, a := range t.Attr {
    if a.Key == "href" {
        fmt.Println("Found href:", a.Val)
        break
    }
}

At this point we know how to fetch HTML using an HTTP request, as well as extract the links from that HTML document. Now let’s put it all together.

3. Introduce Goroutines and Channels

In order to make our scraper performant, and to make this tutorial a bit more advanced, we’ll make use of goroutines and channels, Go’s utilities for executing concurrent tasks.

The trickiest part of this scraper is how it uses channels. In order for the scraper to run quickly, it needs to fetch all URLs concurrently. When concurrency is applied, total execution time should equal the time taken to fetch the slowest request. Without concurrency, execution time would equal the sum of all request times since it would be executing them one after the other. So how do we do this?

The approach I took is to create a goroutine for each request and have each one publish the URLs it finds to a shared channel. There’s one problem with this though. How do we know when the last URL is sent to the channel so we can close it? For this, we can use a second channel for communicating status.

The second channel is simply a notification channel. After a goroutine has published all of it’s URLs into the main channel, it will publish a done message to the notification channel. Then, main thread can subscribe to the notification channel and stop execution after all goroutines have notified that they are finished. Don’t worry, This will make much more sense when you see the finished code.

Putting it All Together

If you’ve made it this far, you should know everything necessary to understand the full program, so here it is. I’ve also added a few comments to help explain some of the more complicated parts.

package main

import (
	"fmt"
	"golang.org/x/net/html"
	"net/http"
	"os"
	"strings"
)

// Helper function to pull the href attribute from a Token
func getHref(t html.Token) (ok bool, href string) {
	// Iterate over token attributes until we find an "href"
	for _, a := range t.Attr {
		if a.Key == "href" {
			href = a.Val
			ok = true
		}
	}
	
	// "bare" return will return the variables (ok, href) as 
    // defined in the function definition
	return
}

// Extract all http** links from a given webpage
func crawl(url string, ch chan string, chFinished chan bool) {
	resp, err := http.Get(url)

	defer func() {
		// Notify that we're done after this function
		chFinished <- true
	}()

	if err != nil {
		fmt.Println("ERROR: Failed to crawl:", url)
		return
	}

	b := resp.Body
	defer b.Close() // close Body when the function completes

	z := html.NewTokenizer(b)

	for {
		tt := z.Next()

		switch {
		case tt == html.ErrorToken:
			// End of the document, we're done
			return
		case tt == html.StartTagToken:
			t := z.Token()

			// Check if the token is an <a> tag
			isAnchor := t.Data == "a"
			if !isAnchor {
				continue
			}

			// Extract the href value, if there is one
			ok, url := getHref(t)
			if !ok {
				continue
			}

			// Make sure the url begines in http**
			hasProto := strings.Index(url, "http") == 0
			if hasProto {
				ch <- url
			}
		}
	}
}

func main() {
	foundUrls := make(map[string]bool)
	seedUrls := os.Args[1:]

	// Channels
	chUrls := make(chan string)
	chFinished := make(chan bool) 

	// Kick off the crawl process (concurrently)
	for _, url := range seedUrls {
		go crawl(url, chUrls, chFinished)
	}

	// Subscribe to both channels
	for c := 0; c < len(seedUrls); {
		select {
		case url := <-chUrls:
			foundUrls[url] = true
		case <-chFinished:
			c++
		}
	}

	// We're done! Print the results...

	fmt.Println("\nFound", len(foundUrls), "unique urls:\n")

	for url, _ := range foundUrls {
		fmt.Println(" - " + url)
	}

	close(chUrls)
}

That wraps up the tutorial of a basic Go web scraper! We’ve covered making HTTP requests, parsing HTML, and even some complex concurrency patterns.

If you’d like to take it a step further, try turning this web scraper into a web crawler and feed the URLs it finds back in as inputs. Then, see how far your crawler gets. 🚀

As always, thanks for reading! :)

If you enjoyed this tutorial, please consider sponsoring my work on GitHub 🤗

Now look what you've done 🌋
Stop clicking and run for your life! 😱
Uh oh, I don't think the system can't handle it! 🔥
Stop it, you're too kind 😄
Thanks for the love! ❤️
Thanks, glad you enjoyed it! Care to share?
Hacker News Reddit

×

Recommended Posts ✍🏻

See All »
• 3 min read
✨ HTML Share Buttons
Read Post »
• 2 min read
👨🏼‍💻 Going Indie, Again
Read Post »
• 3 min read
⛰️ 2022 Recap: Getting Physical
Read Post »