Big Data has fully entered the hype phase of the innovation cycle. There are many classic signs of an over inflated bubble about to burst: it seems you can attend a Big Data conference every week and vendors are spinning their existing products as Big Data to name just two.

All this hype and over exposure agitates my contrarian nature but the reality is that once the bubble bursts the Big Data innovation will climb out of the Trough of Disillusionment and very likely become a big part of running any large corporation in the future.


Background on Big Data

Big Data innovation can arguably be traced back to web-scale companies such as Google and Yahoo! Adopting commercial relational database technology to store and process their search indexes was impractical due to scalability and cost constraints. A complete departure from traditional data warehousing was adopted using large numbers of cheap commodity hardware with resilience built into the software layer which manages data storage and processing.

Google described the detail of their approach in a white paper in 2004.  Projects such as Nutch, the open source web-search software project, built a distributed filesystem and data processing framework based on Google’s approach.  This later went on to become Apache Hadoop and gained wide adoption.

When you hear someone talking about Big Data it won’t be long before you hear them talk about Hadoop. It has almost become synonymous with Big Data but the reality is that it’s just one of many tools in the Big Data landscape, admittedly one which is fantastic at batch processing truly massive structured and semi-structured data sets. Hadoop’s weakness is short-request processing but tools such as Storm and HBase are starting to address this.

A recent fashion is for web-scale companies to open-source their projects, for example we have Storm from Twitter, Cassandra from Facebook, S4 from LinkedIn and even the NSA getting in on the trend open-sourcing Accumulo.


Difficulties in adopting Big Data Tools and Techniques

It’s fantastic to see the level of innovation in open-source projects but these tools and the surrounding eco-system are still extremely immature when compared to existing enterprise data management tools.  This creates challenges for companies adopting them :-

  1. Difficult to find the talent, demand is outstripping supply making it very challenging to hire experienced people. McKinsey are forecasting a shortage of 140,000-190,000 people with the skills and know-how to use the analysis of big data to make effective decisions.
  2. Separating the hype from reality, what will drive real business value verse a product where ‘Big Data’ has been sprinkled into their sales copy.


How to make money with Big Data

“More data usually beats better algorithms” is a truism when building predictive models.   This means not only increasing the volume of data included in your  customer churn model but also adding new-types of data which improve the accuracy of the forecast.  Let’s use clickstream data as an example as with most businesses having an online presence it’s critical to understand how your customers are interacting with you online.   This has progressed from counting how many people are on your site through to storing and analysing individual users clicks and page views.

However cutting edge companies such as Amazon and Netflix have taken this to the next level by tracking ‘What’ they offer to their users, ‘How’ they interact with it and then inferring ‘Why’ things work or don’t work.  They are sending back a stream of implicit user activity without users even clicking a mouse button!  If you are just focusing on tracking clicks and transactions you are missing a potential gold-mine of data.

The challenge comes in figuring out how to store, process and derive insight from the hundreds of terabytes of implicit user data you’ve just generated.  Big Data tools and techniques make it possible to cost effectively generate significant business benefits from this treasure trove new data.  You can :-

  1. identify which customers are exhibiting behaviour predictive of customer churn and then reach out with a retention message
  2. surface break downs in the customer experience and use it to inform data driven product improvement projects
  3. identifying products the customer is likely to purchase and improve revenue by providing personalised recommendations.


Clickstream no longer seems an appropriate name for this new type of data, perhaps Eventstream is better to describe these new finer grained micro events cutting edge companies are generating, storing and using to generate new insights on customer behaviour.


Getting started with Big Data

Get started now! Act quickly at low cost by just get started experimenting and learning what works for you. The goal is to find and invest in successful innovations before your competition does. Adopt a low cost agile trial and error approach over the grand master plan.

A key factor in the success on any Big Data initiative is business sponsorship

Part II of this post will elaborate on how to get started delivering business benefits value Big Data and Data Science in your organisation.