How to Parse Html Pages with Node and cheerio
In this post I’ll show a quick example of how to parse the HTML content using Node and cheerio. Cheerio is the jQuery implementation for the server. So if you know jQuery (and I assume that you do), you will find cheerio easy to use.
To get the HTML content of the page I’ll use the superagent, which is a simple HTTP request module.
For the demo I’ll get the home page of the cheerio site, and then will find all the
Here’s the complete code for the demo:
The third line binds everything together. First, we request the page using the superagent, then get the HTML content of the page from the response, then we extract
h2 headers, and then print them to the console.
getHeaders function returns all
h2 headers from the HTML passed to it.
Parsing a content of a page
The parsing logic that uses cheerio is in the
getHeaders function. First, we need to init the cheerio with the HTML we’d like to parse. This is done using the
load function. After that we can use the returned object in the same way as we use
$ in jQuery. We can apply selectors and navigate our HTML model.
For the demo sake I’m getting all the
Cheerio site provides a great overview of available features and provides more examples.