Build a simple web scraper with node.js

Recently I released my first personal iOS app into the wild. The app is called unit guide for Starcraft 2 and it provides Starcraft 2 player with up to date and accurate information for every unit in the game. Instead of manually creating a huge JSON file I wrote a web scraper in node.js that allows me to quickly extract all the data I need and output it in a JSON format. In this post I will explain how you can build something similar using techniques that are familiar for most web developers.

Step 1: preparing

Before you get started you're going to want to install some dependencies. The ones I have used are: request, cheerio and promise. Installing them will work like this:

npm install --save request cheerio promise

If you don't have npm installed yet then follow the instructions here to install node and npm.

Once you have all the dependencies, you're going to need a webpage that you will scrape. I picked the Starcraft 2 units overview page as a starting point. You can pick any page you want, as long as it contains some data that you want to extract into a JSON file.

Step 2: loading the webpage

In order to start scraping the page, we're going to need to load it up. We'll be using request for this. Note that this will simply pull down the html, for my use case that was enough but if you need the webpage to execute javascript in order to get the content you need you might want to have a look at phantomjs. It's a headless browser that will allow javascript execution. I won't go in to this right now as I didn't need it for my project.

Downloading the html using request is pretty straightforward, here's how you can load a webpage:

var request = require('request');

request('http://eu.battle.net/sc2/en/game/unit/', function(error, result, html){
    if(error) {
        console.log('An error occurred');
        console.log(error);
        return;
    }

    console.log(html);
});

Getting the html was pretty easy right? Now that we have the html we can use cheerio to convert the html string in to a DOM-like object that we can query with css style selectors. All we have to do is include cheerio in our script and use it like this:

var request = require('request');
var cheerio = require('cheerio');

request('http://eu.battle.net/sc2/en/game/unit/', function(error, result, html){
    if(error) {
        console.log('An error occurred');
        console.log(error);
        return;
    }

    var $ = cheerio.load(html);
});

That's it. We now have an object that we can query for data pretty easily.

Step 3: finding and extracting some content

Now that we have the entire webpage loaded up and we can query it, it's time to look for content. Or in my case, I was looking for references to pages that contain the actual content I wanted to extract. The easiest way to find out what you should query the DOM for is to use the "inspect element" feature of your browser. It will give you an overview of all of the html elements on the page and where they are in the page's hierarchy. Here's part of the hierarchy I was interested in:

Screen Shot 2016-02-29 at 15.42.15

You can see an element that has the class table-lotv in the hierarchy. This element has three children with the class unit-datatable. The contents of this unit-datatable are of interest for me because somewhere in there I can find the names of the units I want to extract. To access these data tables and extract the relevant names you could use a query selector like this:

$('.table-lotv .unit-datatable').each(function(i, dataTable){
    var race = $(dataTable).find('.title-bar span').text();
    var $unitnames =  $(dataTable).find('.databox table .button-rollover');
});

In the above snippet $('.table-lotv .unit-datatable') selects all of the data tables. When I loop over these I have access to the individual dataTable objects. Inside of these objects I have found the race name (Terran, Protoss or Zerg) which is contained inside of a span element which is contained in an element with the class title-bar. Extracting the name isn't enough for my use case though. I also want to scrape each unit's page and after doing that I want to write all of the data to a JSON file at once. To do this I used promises. This is a great fit because I can easily create an array of promise objects and wait for all of them to be fulfilled. Let's see how that's done, shall we?

Step 4: build your list of promises

While we're looping over the dataTable objects we can create some promises that will need to be fulfilled before we output the big JSON file we're aiming for. Let's look at some code:

var request = require('request');
var cheerio = require('cheerio');
var Promise = require('promise');
var fs = require('fs');

request('http://eu.battle.net/sc2/en/game/unit/', function(error, result, html){
    //error handling code and cheerio loading

    var promises = [];

    $('.table-lotv .unit-datatable').each(function(i, dataTable){
        var race = $(dataTable).find('.title-bar span').text();
        var $unitnames =  $(dataTable).find('.databox table .button-rollover');

        promises.append(scrapeUnits($unitnames));
    });

    Promise.all(promises).then(function(promiseResults){
        var data = {};

        // use the promiseResults to populate and  build your data object...

        // write the JSON file to disk
        fs.writeFile('public/units.json', JSON.stringify(data), function(err){
            if(err) { console.log(err); }
        });
    });
});

Okay, so in this snippet I included Promise to the requirements. Inside of the request callback I created an empty array of promises. When Looping over the data tables I insert a new promise which is returned by the scrapeUnits function (I'll get to that function in the next snippet). After looping through all of the data tables I use the Promise.all function to wait until all promises in my promises array are fulfilled. When they are fulfilled I use the results of these promises to populate a data object (which is our JSON data). The function we provide to the then handler for Promise.all receives one argument. This argument is an array of results for the responses we put in the promises array. If the promises array contains three elements, then so will the promiseResults. Finally I write the data to disk using fs. Which is also added in the requirements section. (fs is part of node.js so you don't have to install that through npm).

Step 5: nesting promises is cool

In the previous snippet I showed you this line of code:

promises.append(scrapeUnits($unitnames));

The function scrapeUnits is a function which returns a promise, let's have a look at how this works, shall we?.

function scrapeUnits(unitUrls) {
    return new Promise(function(fulfil, reject) {
        var units = [];

        // some code that loads a new page with request
        // some code that uses querySelectors and cheerio to extract data
        // some code that creates a unit object and eventually adds units to the array

        // eventually we're done with grabbing data for our units and we do this:        
        fulfil(units);
    });
}

This function is pretty straightforward. It returns a new Promise object. A Promise object takes one function as a parameter. The function should take two arguments, fulfil and reject. The two arguments are functions and we should call them to either fulfil the Promise when our operation was successful, or we reject it if we encountered an error. When we call fulfil, the Promise is "done". When we use Promise.all, the then handler will only get called if all promises passed to all have been fulfilled.

Step6: Putting it all together

var request = require('request');
var cheerio = require('cheerio');
var Promise = require('promise');
var fs = require('fs');

request('http://eu.battle.net/sc2/en/game/unit/', function(error, result, html){
    if(error) {
        console.log('An error occurred');
        console.log(error);
        return;
    }

    var $ = cheerio.load(html);

    var promises = [];

    $('.table-lotv .unit-datatable').each(function(i, dataTable){
        var race = $(dataTable).find('.title-bar span').text();
        var $unitnames =  $(dataTable).find('.databox table .button-rollover');

        promises.append(scrapeUnits($unitnames));
    });

    Promise.all(promises).then(function(promiseResults){
        var data = {};

        // use the promiseResults to populate and  build your data object...

        // write the JSON file to disk
        fs.writeFile('public/units.json', JSON.stringify(data), function(err){
            if(err) { console.log(err); }
        });
    });
});

function scrapeUnits(unitUrls) {
    return new Promise(function(fulfil, reject) {
        var units = [];

        // some code that loads a new page with request
        // some code that uses querySelectors and cheerio to extract data
        // some code that creates a unit object and eventually adds units to the array

        // eventually we're done with grabbing data for our units and we do this:        
        fulfil(units);
    });
}

The above script is a stripped version of the code I wrote to scrape all of the unit information I needed. What you should take away from all this, is that it's not very complex to build a scraper in node.js. Especially if you're using promises. At first promises might seem a bit weird, but if you get used to them you'll realise that they are the perfect way to write maintainable and understandable asynchronous code. Especially Promise.all is a very fitting tool for what we're trying to do when we scrape multiple webpages that should be merged into a single JSON file. The nice thing about node.js is that it's javascript so we can use a lot us the technology we also use in a browser. Such as the css / jQuery selectors that cheerio makes available to us.

Before you scrape a webpage, please remember that not every webpage owner appreciates it if you scrape their page to use their content so make sure to only scrape what you need, when you need it. Especially if you start hitting somebody's websites with hundreds of requests you should be asking yourself if scraping this site is the correct thing to do.

If you have questions about this article, or would like to learn more about how I used the above techniques, you can let me know on Twitter

Step up your async game with PromiseKit

Some of the most engaging apps we use today are apps that require network connectivity of some kind. They communicate with an API somewhere to fetch and store data for example. Or they use an API to search through a huge amount of data. The point is, you don't want your application to sit around and wait while an API call is happening. The same is true for a computing task that's heavy, for example resizing an image or storing it to disk. You want your UI to be snappy and fast. In other words, you don't want to do your heavy lifting on the main (UI) thread of a device.

Taking something off of the main thread

There's several ways to take something away from the main thread. An NSURLConnection automatically performs requests in the background and uses delegates to make callbacks about progress or completion for example. You can also use the dispatch_async function to make something happen on a different thread. While this works perfectly fine it's not very ideal. When an NSURLConnection performs it's delegate callbacks it has to be coupled to that delegate. Also, imagine you want to chain together a couple of network requests; it will start to become kind of messy to keep track of everything.

Now imagine an object that you can pass around, it will automatically perform a task once it's done what it's supposed to do. These tasks can be chained so if you want to chain multiple things together it's very simple. Or maybe you want your callback to fire only if a couple of tasks are complete. Imagine implementing that with multiple NSURLConnections. You would probably have multiple instances and whenever one is complete you check the status of the others and then when all are complete you can actually execute the code you've been meaning to execute. That sounds a lot more complicated than just writing:

wait(request1, request2, request3).then { (result1: NSData, result2: NSData, result3: NSData) in }

The above snippet is actually really close to how PromiseKit works. Let's explore that a bit further, shall we?

Note: I am going to apply PromiseKit to network requests for this post. The concepts actually apply to anything that you want to do asynchronously.

A simple Promise example

To demonstrate a simple Promise we'll make a network request. First we create an NSURLRequest that we'll pass to an NSURLConnection. Then we kick off the loading with a Promise so we can use the result once the request is done:

var req =  NSURLRequest(URL: NSURL(string: "http://example.com/api/feed/")!))
NSURLConnection.promise(req).then{ (data: NSDictionary) in
  // use the data we just received
}

PromiseKit has provided us with an extension on NSURLConnection that allows us to call promise(req:NSURLRequest) on it. After that we call then. The code that's inside of this closure gets called once the Promise is fulfilled. This happens whenever the request completed with success. If the request fails we can add a report ('catch' in swift 1.2) as well to make sure we catch that error:

var req =  NSURLRequest(URL: NSURL(string: "http://example.com/api/feed/")!))
NSURLConnection.promise(req).then{(data: NSDictionary) in
  // use the data we just received
}.report{ error in
  // do something with the error
}

And if there's code we want to execute regardless of error or success we can use ensure (defer in swift 1.2) like this:

var req =  NSURLRequest(URL: NSURL(string: "http://example.com/api/feed/")!))
NSURLConnection.promise(req).then{ (data: NSDictionary) in
  // use the data we just received
}.report{ error in
  // do something with the error
}.ensure{
  // perform action regardless of result
}

If you understand this, you basically understand all you need to know to start using Promises in a very basic way. But let's get a little bit more advanced and start returning our own Promise objects.

Returning Promises

Imagine this, we're building an application that uses an API. We want to ask the API for a user's feed and we want to use PromiseKit for this. A nice implementation might be to have an API instance that has a method on it called fetchUserFeed. That method will return a Promise so we can easily use the result from the API in the class that actually wants the API to fetch data, a ViewModel for example. The fetchUserFeed function might look something like this:

func fetchUserFeed() -> Promise<Feed>
  var req =  NSURLRequest(URL: NSURL(string: "http://example.com/api/feed/")!))
  return NSURLConnection.promise(req).then{(data: NSDictionary) -> Feed in
    return Feed(dataDict: data)
  }
}

Note: Feed is a just a data object not included in the sample for brevity. It is used to illustrate how you would return something from a Promise

The function above is very similar to what we had before except now it returns NSURLConnection.promise which is a Promise. The then of that Promise now returns a Feed and the fetchUserFeed function now returns Promise<Feed>. What this means is that fetchUserFeed now returns a Promise that will resolve with a Feed. So if we use this function it looks like this:

let api = DWApi()
api.fetchUserFeed().then{ feed in 
  // use the feed
}

That's pretty clean, right? Now let's say that we not only want to fetch the Feed but also a user's Profile. And we want to wait until both of these requests are done (and successful). We can use the when function for that:

let api = DWApi()
when(api.fetchUserFeed(), api.fetchUserInfo()).then {feed, info in 
  // both requests succeeded, time to use the feed and info
}.report { error in
  // one or both of the requests have failed
}

Let's make this this a little bit more complicated shall we? Currently we're able to use an API to fetch stuff, and we're able to do multiple requests and wait until they're all complete. Now we're going to wrap that into a function we can call from somewhere else. The function will return a single object that uses both a Feed and UserInfo to create itself. Let's call it ProfileModel.

func fetchProfileModel() -> Promise<ProfileModel> {
  let api = DWApi()
  return when(api.fetchUserFeed(), api.fetchUserInfo()).then {(feed: Feed, info: UserInfo) -> ProfileModel in 
    return ProfileModel(feed: feed, info: info)
  }
}

And when we want to use this function we would write something like this:

let viewModel = DWViewModel()
viewModel.fetchProfileModel().then{ profileModel in 
  // use profile model
}.report { error in
  // handle errors
}

That's pretty cool isn't it? Pretty complicated logic wrapped in promises to make it simple and enjoyable again.

Wrapping it up

In this post we've touched up on the basics of PromiseKit, a library that makes asynchronous programming cleaner and easier. I've shown you how to use them in a very simple and basic setting and in a situation where you wait for multiple operations/promises and return a single promise with both results. Promises can help you to build a very clean API for your async operations and they help you to keep your code clean and readable. I highly suggest to try using promises in your own projects, they're really cool and easy to use. To find out more about PromiseKit you should check out their github.

If you have questions or feedback for me on this subject make sure to hit me up on Twitter or look for me in the ios-developers slack community.