Term popularity based on Google results

The recent surge of hype around the buzzword ‘Big Data’ has largely overshadowed the rise of ‘Data Science’ – which in my humble opinion is a far more important field of innovation and research. So what is data science and where did it come from?

 

Brief history of data science

The term ‘data science’ in one form or another has been in circulation for years. Originating from the intersection between the computer science and statistics disciplines, its meaning has evolved gradually over time. In its latest incarnation, data science is very closely linked to large web-scale companies (Google, Facebook, LinkedIn) and the ‘big data’ movement.

The modern definition of data science (arguably) originates from a McKinsey interview with Hal Varian (Chief Economist @ Google) in January 2009 where they discussed the commoditisation of data.

Over the previous two decades Moore’s Law had dramatically reduced the cost to gather, store and compute data. Hal foresaw increasing demand for talented people who could derive insight and value from that data. One quote from the original article that sums it up nicely is;

“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids.”
Hal Varian, Chief Economist @ Google

 

What is a Data Scientist?

There are a number of great quotes and definitions about who/what a data scientist is, here are just a couple;

“A Data Scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning.”
Hilary Mason, Chief Data Scientist @ Bit.ly

“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”
Josh Wills, Data Scientist @ Cloudera

There are a myriad of definitions, but they all tend to gravitate around these key requirements;

1. Good background in statistics and mathematics,
2. Knowledge of a programming language (Python, Java, etc.),
3. Inquisitive nature to formulate questions and hypotheses,
4. Ability to effectively communicate results to others,
5. Domain knowledge in the relevant field,

My own view is subtly different, reporting results and findings in some swanky dashboard or infographic is all well and good. One might even go as far as to say that is more ‘data journalism’ or even ‘business intelligence 2.0’ than data science. I firmly believe it is important to be able to implement models and test predictions in the real-world, creating a feedback loop that is more in-line with the spirit of the ‘scientific method’.

 

My definition of a Data Scientist’s job

Data is increasingly being touted as the “new oil”, however this cringe-worthy metaphor is actually excellent for describing the difference between ‘big data’ and ‘data science’.

In a lot of ways, ‘big data’ could be thought of as civil engineering, the ability to create great structures and machines to do something – like for example an offshore Oil Rig. As fantastic an accomplishment this is, it required chemical engineers (data scientists) to extract value from the soup of oil, mud and water that the Oil Rig brings up.

The interesting part about this example is that both the civil and chemical engineers had to work closely together to make sure that the science could be realised with the state-of-the-art technology. There was no point in the chemical engineers coming up with a process that negated the value that could be obtained from the oil well. This can be readily translated into the ‘big data’ and ‘data science’ world, going forward they will become increasingly dependent on one another.

As far fetched as this example sounds, it has already happened and is widely known. The winning entry to Netflix’s $1 million dollar competition was actually never implemented. Its complexity outweighed the benefits and Netflix opted to use a much simpler algorithm with a marginally lower accuracy score.