We need to extract links from the HTML both the links people can click AND the images that are displayed. This gives us a complete picture of what resources each page references.
For example, this HTML page has both a link and an image:
<html>
<body>
<a href="https://blog.boot.dev">Go to Boot.dev</a>
<img src="/logo.png" alt="Boot.dev Logo" />
</body>
</html>
We need to extract both https://blog.boot.dev (from the link) and /logo.png (from the image).
func getURLsFromHTML(htmlBody string, baseURL *url.URL) ([]string, error) {
htmlBody is an HTML stringbaseURL is the (already parsed) root URL of the website we're crawling. This will allow us to rewrite relative URLs into absolute URLs.In your tests, make sure that:
<a> tags in a body of HTMLHere's one example test case to give you an idea:
func TestGetURLsFromHTMLAbsolute(t *testing.T) {
inputURL := "https://blog.boot.dev"
inputBody := `<html><body><a href="https://blog.boot.dev"><span>Boot.dev</span></a></body></html>`
baseURL, err := url.Parse(inputURL)
if err != nil {
t.Errorf("couldn't parse input URL: %v", err)
return
}
actual, err := getURLsFromHTML(inputBody, baseURL)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
expected := []string{"https://blog.boot.dev"}
if !reflect.DeepEqual(actual, expected) {
t.Errorf("expected %v, got %v", expected, actual)
}
}
doc.Find("a[href]").Each(func(_ int, s *goquery.Selection) {
// For each '<a href>' it finds, it will run this function.
}
href attribute containing the actual URL. e.g:<a href="https://www.boot.dev">Learn Backend Development</a>
func getImagesFromHTML(htmlBody string, baseURL *url.URL) ([]string, error) {
htmlBody is an HTML stringbaseURL is the (already parsed) root URL of the website we're crawling. This will allow us to rewrite relative URLs into absolute URLs.In your tests, make sure:
Here's one example test case to give you an idea:
func TestGetImagesFromHTMLRelative(t *testing.T) {
inputURL := "https://blog.boot.dev"
inputBody := `<html><body><img src="/logo.png" alt="Logo"></body></html>`
baseURL, err := url.Parse(inputURL)
if err != nil {
t.Errorf("couldn't parse input URL: %v", err)
return
}
actual, err := getImagesFromHTML(inputBody, baseURL)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
expected := []string{"https://blog.boot.dev/logo.png"}
if !reflect.DeepEqual(actual, expected) {
t.Errorf("expected %v, got %v", expected, actual)
}
}
Run and submit the CLI tests from the root of your module.