The Cyzag team responded to a call by the kaggle data science community to help the world health organization improve its understanding of the Covid-19 pandemic. The team’s effort stood out amongst hundreds of datasets and won one of the three prizes on offer.
The team began the challenge with an Exploratory Data Analysis (EDA) of the existing data and literature looking particularly to answer questions related to case forecasting and transmission rates. Both questions were imperative to help governments to improve their understanding and their response to the pandemic.
The initial observation of the virus was that it followed an exponential trajectory and would continue on this trajectory unabated unless action was taken to stem the spread of the virus. At the time of submission, China had already gone through the exponential phase and, as a result of going into lockdown, had avoided unhindered contagion. This provided a useful starting point showing that it was possible to shift from exponential growth to one that follows a standard logistic curve with diminishing growth, leading to complete (or close to) zero transmission.
Through research three things would “flatten the curve”:
With the absence of the first two it became clear that the single biggest influence on transmission rates was community lockdown. In terms of forecasting, knowing when a country went into lockdown was the critical piece of information required to help predict final number of cases and necessary duration of a lockdown.
The team set about the arduous task of compiling a list of lockdown dates per country and making it available for public use and scrutiny. No official definition of “lockdown” exists and so the first task was to set a definition for what constitutes a country being in lockdown. Each country followed a different variation of lockdown, however the one consistent factor was a closure of schools and universities. Thus it was decided to choose the date of school closures as being the first day of ‘lockdown’. To make the dataset easy to use and understand it was necessary to compile detailed documentation and a usage notebook with sample code.
While seemingly simple, a dataset was not available listing the individual lockdown dates per country and region. This invalidates the common myth, that for data science to be successful it needs to be complicated. Instead, falling in line with Cyzag’s philosophy, the secret to successful data science projects is:
At the time of writing, the Cyzag dataset has been viewed over 9000 times and downloaded 1400 times for use in various machine learning competitions.