How to Parse Html Pages with Node and cheerio
In this post I’ll show a quick example of how to parse the HTML content using Node and cheerio. Cheerio is the jQuery implementation for the server. So if you know jQuery (and I assume that you do), you will find cheerio easy to use.
To get the HTML content of the page I’ll use the superagent, which is a simple HTTP request module.
For the demo I’ll get the home page of the cheerio site, and then will find all the h2
headers.
Here’s the complete code for the demo:
The third line binds everything together. First, we request the page using the superagent, then get the HTML content of the page from the response, then we extract h2
headers, and then print them to the console.
The getHeaders
function returns all h2
headers from the HTML passed to it.
Parsing a content of a page
The parsing logic that uses cheerio is in the getHeaders
function. First, we need to init the cheerio with the HTML we’d like to parse. This is done using the load
function. After that we can use the returned object in the same way as we use $
in jQuery. We can apply selectors and navigate our HTML model.
For the demo sake I’m getting all the h2
headers.
Cheerio site provides a great overview of available features and provides more examples.