URL Extraction

Our link tracker will need to know how to read a page of HTML text and extract links.

For example, the following HTML page has a single link to https://blog.boot.dev:

<html>
  <body>
    <a href="https://blog.boot.dev"><span>Go to Boot.dev</span></a>
  </body>
</html>

We'll use a third-party HTML parsing library called JSDOM to find and extract links.

Assignment

We want to write a new function called getURLsFromHTML in the crawl.ts file. It takes 2 arguments. The first is an HTML string, while the second is the root URL of the website we're crawling. This will allow us to rewrite relative URLs into absolute URLs. Lastly, it returns an un-normalized array of all the URLs found within the HTML.

Stub out the new getURLsFromHTML function and write some tests. Here's the signature:

function getURLsFromHTML(html: string, baseURL: string)

Here are some ideas for writing your tests:

Test that relative URLs are converted to absolute URLs.
Test to be sure you find all the "anchor" elements in a body of HTML

Install jsdom and its types:

npm install jsdom
npm install -D @types/jsdom

This will install jsdom as a "dependency" (as opposed to vitest which is a "devDependency" and was installed with the -D flag). "Dev dependencies" are not required to run your application, they're only required for development (like testing). Regular dependencies are required to run the program itself.

Implement the getURLsFromHTML function.

I'll try not to give too many hints: you should go read the JSDOM docs! That said here are a few:

import { JSDOM } from 'jsdom'
new JSDOM(htmlBody) creates a new "document object model"
dom.window.document.querySelectorAll('a') returns an array of <a> tag "anchor" elements

In HTML, "anchors" are links. e.g:

<a href="https://boot.dev">Learn Backend Development</a>

Once you're satisfied that your function works as expected, move on to the next step!