Concurrency

Why write a web crawler with Go and not use concurrency? Let's speed this bad boy up with some goroutines!

Assignment

Create a struct that our goroutines can share access to as they crawl the website. Here's an example of what I used:

type config struct {
	pages              map[string]PageData
	baseURL            *url.URL
	mu                 *sync.Mutex
	concurrencyControl chan struct{}
	wg                 *sync.WaitGroup
}

Keep track of the pages we've crawled (pages map)
Keep track of the original base URL (baseURL)
Ensure the pages map is thread-safe (mu Mutex)
Ensure we don't spawn too many goroutines at once (concurrencyControl channel). It's a buffered channel of empty structs. When a new goroutine starts, we'll send an empty struct into the channel. When it's done, we'll receive an empty struct from the channel. This will cause new goroutines to block and wait until the buffer has space for their "send". (For example, a buffer size of 5 means at most 5 requests at once)
Ensure the main function waits until all in-flight goroutines (HTTP requests) are done before exiting the program (wg WaitGroup)

Update your crawlPage function signature to be:

func (cfg *config) crawlPage(rawCurrentURL string)

We remove some parameters because they're available via the struct now.

I created this method to call as a helper inside of crawlPage:

func (cfg *config) addPageVisit(normalizedURL string) (isFirst bool)

Test it (I tested it manually again). I recommend setting your maxConcurrency (the buffer size of the concurrencyControl channel) to 1 to start to just make sure everything still works. Then bump it up to 2. Make sure it still works, and that it's about twice as fast. Then bump it up to 5 or 10. Make sure you still get the same data, but it should be much faster.

When you're satisfied with the results, you can move on.

Tips

Make sure you're not crawling the same page multiple times. That's why my addPageVisit method returns a boolean: to indicate if it's the first time we've seen the page.
I defer a wg.Done() and a receive from the (<-cfg.concurrencyControl) channel. This ensures that the wg is decremented and the channel is emptied even when the goroutine errors and returns early.
Make sure you wg.Add before you spawn a goroutine, but that you send into the channel after but still at the start of a new goroutine.
Because all the recursive calls share a wait group, you should only need to wg.Wait() once in the main function.