Why write a web crawler with Go and not use concurrency? Let's speed this bad boy up with some goroutines!
type config struct {
pages map[string]PageData
baseURL *url.URL
mu *sync.Mutex
concurrencyControl chan struct{}
wg *sync.WaitGroup
}
pages map)baseURL)pages map is thread-safe (mu Mutex)concurrencyControl channel). It's a buffered channel of empty structs. When a new goroutine starts, we'll send an empty struct into the channel. When it's done, we'll receive an empty struct from the channel. This will cause new goroutines to block and wait until the buffer has space for their "send". (For example, a buffer size of 5 means at most 5 requests at once)main function waits until all in-flight goroutines (HTTP requests) are done before exiting the program (wg WaitGroup)func (cfg *config) crawlPage(rawCurrentURL string)
We remove some parameters because they're available via the struct now.
I created this method to call as a helper inside of crawlPage:
func (cfg *config) addPageVisit(normalizedURL string) (isFirst bool)
When you're satisfied with the results, you can move on.
addPageVisit method returns a boolean: to indicate if it's the first time we've seen the page.<-cfg.concurrencyControl) channel. This ensures that the wg is decremented and the channel is emptied even when the goroutine errors and returns early.main function.Login to Complete
Login to view solution