On Saturday 29th April, teams of data scientists around the world sat down to begin a 24 hour ‘hackathon’. The challenge was to predict air quality given sample data from Cook County, Illinois. Over the duration of the contest, 212 people participated from their homes or venues setup in one of ten cities. The contest was held on the competitive data science platform Kaggle and coordinated by Data Science London.
The provided training data included a plethora of variables, measured from a number of locations:
– weather (temperatures, pressures, wind speed & direction),
– solar radiation,
– month of the year, day of week, hour of day,
The data was partitioned into 210 ‘chunks’ of contiguous measurements over a period of 192 hours (8 days). The task was to make predictions at specific points in time over the following 72 hours for each chunk. Data quality was a significant issue, there were lots of ‘NA’ values in the training data.
There were a number of different approaches used, the competition organisers provided two ‘benchmark’ entries which simply took the mean value for that hour (overall) and mean value for the hour in that chunk.
Building on this approach a number of competitors made good progress simply by improving on this very simple mean/median-based technique. Autoregressive integrated moving average (ARIMA) models made further gains. The most notable thing about these approaches is that they completely ignored the other available data.
Last but not least were the more exotic models, random forests, regression models, neural networks and more. The winning entry from Ben Hamner was achieved using random forests and implemented in Matlab. The surprising thing about Ben’s approach is that he didn’t even look at the data, instead he just generated 390 random forest models (39 target variables x 10 positions in chunk). You can read more detail in his blog post ‘Chucking everything into a Random Forest’.
This contest was apparently the first globally coordinated ‘data science hackathon’, it was a great example of how the Kaggle platform can host competitions. The 24 hour deadline and a limitation on the number of entries that could be submitted added an interesting dynamic to focus the minds of the competitors.
Some teams spent a significant amount of time working with the provided weather data after seeing correlations between the weather and target variables. Of course weather information was not available for the 72 hour period being predicted for, so if your model relied on weather data you had to predict those variables too and then go on to predict the target variables. While it is widely accepted that weather has a significant role on air quality, it seems that for this 24 hour competition it was a distraction.