English: shows the demographic transition and ...

Here at TUMRA we are building a multi-faceted data processing platform capable of handling unstructured data (text, video, images) as easily as structured and relational data.  Throughout its development we’ve manufactured use-cases to solve that improve the platform with new features and capabilities.

So, a couple of weeks ago we decided to focus on improving our ‘real-time machine learning’ functionality.  The initial objective was to create a distributed real-time classification algorithm.

It had to be:

  • massively scalable: easily scale horizontally across many machines,

  • distributed: load had to be shared evenly with consistent results across the cluster,

  • online learning: the model can be continuously trained with new data while also being available to make predictions,

You encounter classification algorithms everyday, most common is the humble ‘spam filter’.  Trained with two buckets of email messages ‘Spam’ and ‘Ham’ the classifier algorithm can then be used to predict which bucket a new email message belongs to.

One of the simplest classifier algorithms is Naïve Bayes; at its core all we have to do is count occurrences of labels for each class (A or B).  Computing the probability for a given label belonging to class A or B is straightforward.  Out-of-the-box naïve bayes is prone to problems if a given label that has only been witnessed in one class and not the other (A not B), to avoid this we applied Laplace smoothing when computing probabilities.

Next we needed a readily available dataset to play with, preferably a real-time source that lends itself to supervised learning.  Social media was the first thing that sprang to mind having already built our own social data collection components a week earlier.  The only problem was Twitter data was missing anything useful for training the algorithm, and then we stumbled onto the Facebook Graph API.  For any given Facebook user id, we could retrieve their name (first, middle and last name) and for almost all users we also got their locale and gender. e.g.

   "id": "100000000000000",
   "name": "Michael Cutler",
   "first_name": "Michael",
   "last_name": "Cutler",
   "gender": "male",
   "locale": "en_GB"


With this in mind we manufactured a simple use case, “Predict the demographic group of a user based on a social media interaction in real-time”, or more simply “real-time demographics”.  The idea being that we can use supervised learning (from Facebook data) to train a model that can be used elsewhere.

There are various demographic features normally used in the advertising business:

  • – gender

  • – age

  • – marital status

  • – number of children

  • – location

  • – economic background

For the first phase of this work we settled on the simplest of them all ‘gender’. From the very start we wanted to make sure our models and systems handled unicode text as well as any other, it was vital this worked regardless of the character set the message was written in.

We have already been building on a lot of the great work in the Mahout machine learning library.  However in its current form Mahout is difficult to work with in a real-time setting.  A lot of the open-source community effort has gone into making the algorithms distributed in the batch-processing Map/Reduce framework, but solving the same problems in real-time stream processing is very different.

Solving our use case required quite a departure from the Mahout-way of doing things, so while we based it on the interfaces and API’s – the implementation was actually written from scratch.

Firstly we needed a distributed eventually-consistent ‘DataModel’ implementation to back the algorithm, where all the counts/weights are stored.  Thus the imaginatively named ‘DistributedEventuallyConsistentDataModel’ was born, it is a caching façade to a remote ‘DataModel’ running on one or more servers coordinated by Zookeeper.

Next, we needed a really really fast, really really simple naïve bayes classifier implementation.  After looking at adapting one of the existing Mahout implementations we just rolled our own, it was simple enough and saved some time.

With the building blocks in place, we pieced it all together based on the high-level architecture diagram sketched on the back of a Starbucks napkin.

Essentially we wanted an easily scalable solution, decoupled from our social media platform by a message queue – as long as there were enough ‘workers’ running, the message queue would always sit near empty.

This type of workload is ideally suited to use Amazon’s spot instances, our data platform already has our own concept of ‘auto scaling’ based on queue lengths and latencies so it made a lot of sense.  Just for fun, we hosted the whole thing on the spot instance market (even the distributed data model) making sure we took regular snapshots to back it all up.

After a few days of plumbing we got the first end-to-end real-time topology running, with our own auto-scale-like functionality enabled we were running anything between 10 and 150 micro instances to keep up with the incoming messages.  After being processed messages are written directly to HBase by each worker and dropped onto an output message queue from which the Dashboard displays samples.

For the demo at the Data Science London meetup – I wanted to make it interactive, allowing the audience to tweet-in and be subjected to our demographics algorithm. Switching from processing Facebook interactions to messages from Twitter was straightforward, both systems shared the same distributed data-model and without bothering to store the messages the end-to-end latency was just a handful of milliseconds.

With demo at the ready and armed with my laptop I set off to the Innovation Warehouse to give my presentation, about half-way there on the train I noticed that all my spot instances had been kicked off.  After a panic’d rummage around the source code I realised that I had unwittingly hard-coded a very dumb spot instance pricing strategy (current price + 10%), a tiny bump in the spot price and the fact I labeled all the instances with the same ‘launch group’ meant they all went down together.

Arriving at the venue a little early, I had just enough time to fix a better spot pricing strategy, re-launch everything, restore my model and unleash the workers on a backed up message queue.  I’m pleased to say the presentation went great, there was lots of audience participation with people tweeting-in to the #ds_ldn hashtag and the model only predicted one person’s gender incorrectly!

The presentation itself was driven entirely from the Dashboard system we created for our data processing platform, you can view the slides I presented and see the demo running here: