# DengAI: Predicting Disease Spread — Imputation and Stationary Problems

One of the biggest data challenge on DrivenData, with more than 9000 participants is the DengAI challenge. The objective of this challenge is predict the number of dengue fever cases in two different cities.

This blogpost series covers our journey of tackling this problem, starting from initial data analysis, imputation and stationarity problems up un to the different forecasting attempts. This first post covers the imputation and stationarity checks for both cities in the challenge, before moving on to trying different forecasting methdologies.

Throughout this post, code-snippets are shown in order to give an understanding of how the concepts discussed are implemented into code. The entire Github repository for the imputation and stationary adjustment can be found here.

Furthermore, in order to ensure readability we decided to show graphs only for the city San Jose instead showing it for both cities.

# Imputation

Imputation describes the process of filling missing values within a dataset. Given the wide range of possibilities for imputation and the severe amount of missing data within this project, it is worthwhile to go over some of the methods and empirically check which one to use.

Overall, we divide all imputation methods into the two categories: *basic *and *advanced. *With basic methods we mean off the shelf, quick imputation methods, which are oftentimes already build into Pandas. Advanced imputation methods deal with model-based approaches where the missing values are attempted to be predicted, using the remaining columns.

Given that the model-based imputation methods normally result in superior performance, the question might arise why we do not simply use the advanced method for all columns. The reason for that is that our dataset has several observations where all features are missing. The presence of these observations make multivariate imputation methods impossible.

We therefore divide the features into two categories. All features, or columns, which have fewer than 1 percent missing observations are imputed using more basic methods, whereas model-based approaches are used for features which exhibit more missing observations than this threshold.

The code snippet below counts the percentage of missing observations, divides all features into one of two aforementioned categories and creates a graph to visualize the results.

The resulting graph below shows four features which have more than 1 percent of observations missing. Especially the feature *ndvi_ne, *which describes satellite vegetation in the north-west of the city has a severe amount of missing data, with more around 20% of all observation missing.

All imputation methods applied are compared using the normalized root mean squared error (NRMSE). We use this quality estimation method because of its capability for making variables with different scales comparable. Given that the NRMSE is not directly implemented in Python, we use the following snippet to implement it.

## Basic imputation methods

Python, and in particular the library Pandas, has multiple off-the-shelf imputation methods available. Arguably the most basic ones are forward fill (*ffill*) and backward fill (*bfill*), where we simply set the missing valueequal to the prior value (*ffill*) or to the proceeding value (*bfill*).

Other methods include the linear or cubic (the Scipy package also includes higher power if wanted) interpolation around a missing observation.

Lastly, we can use the average of the k nearest neighbours of a missing observations. For this problem we took the preceding and proceeding four observations of a missing observation and imputed it with the average of these eight values. This is not a build in method and therefore defined by us in the following way:

Now it is time to apply and compare of these methods. **We do that by randomly dropping 50 observations of all columns**, which are afterwards imputed by all before mentioned methods. Afterwards we assess each method’s performance through their NRMSE score. All of that, and the graphing of the results is done through the following code snippet.

The resulting graph below clearly shows which method is to be favored, namely the k nearest neighbours approach. The linear method also performs well, even though not as well as the knn method. The more naive methods like ffill and bfill do not perform as strongly.

Afterwards, we impute all features which had fewer observations missing than our threshold of one percent. That means all features except the first four. The code below selects the best method for each column and afterwards imputes all actual missing values.

## The potential flaws of the knn approach

Unfortunately, the superior performance of the knn model comes with a price. For some features, we do not have a only one observation missing at a time, but multiple consecutively missing observations.

If for example we have 12 consecutive missing observations, the knn method cannot calculate any average out of the preceding and proceeding four observations, given that they are missing as well.

The image below, which was created with the beatiful missingno package, shows us that all four columns which were classified as being above our one percent threshold have at one point 15 consecutive missing observations. This makes it impossible to use the knn method for these columns and is the reason why we cannot use this imputation method for the heavily sparse columns.

## Model-based imputation methods

The model-based imputation methods use, as already described earlier, the column with the missing observations as the target and uses all other possible columns as the features. After imputing all columns with fewer than one percent missing observations, we can now use all of them as features.

The model we are using is a RandomForestRegressor because of its good handling of noisy data. The code snippet below which hyperparameters were gridsearched.

`imputation_model = {`

"model": RandomForestRegressor(random_state=28),

"param": {

"n_estimators": [100],

"max_depth": [int(x) for x in

np.linspace(10, 110, num=10)],

"min_samples_split": [2, 5, 10, 15, 100],

"min_samples_leaf": [1, 2, 5, 10]

}

}

We now run all four columns through the model-based approach and compare their performance to all aforementioned basic imputation methods. The following code snippet takes care of exactly that.

Below we can see that our work was worthwhile. For three out of four columns we find a superior performance of the model-based approach compared to the basic imputation methods. We are now left with a fully imputed dataset with which we can proceed.

# Stationarity Problems — Seasonality and Trend

In contrast to cross-sectional data, time series data comes with a whole bunch of different problems. Undoubtedly one of the biggest issues is the problem of stationarity. **Stationarity describes a measure of regularity**. It is this regularity which we depend on to exploit when building meaningful and powerful forecasting models. The absence of regularity makes it difficult at best to construct a model.

There are two types of stationarity, namely ** strict and covariance stationarity**. In order for a time series to be fulfil s

*trict*stationarity, the series needs to be time independent. That would imply that the relationship between two observations of a series is only driven by the timely gap between them, but not on the time itself. This assumption is difficult, if not impossible for most time series to meet and therefore more focus is drawn on

*covariance*stationarity.

For a time series to be *covariance* stationary, it is required that the unconditional first two moments, so the mean and variance, are finite and do not change with time. It is important to note that the time series is very much allowed to have a varying conditional mean. Additionally, it is required that the auto-covariance of a time series is only depending on the lag number, but not on the time itself. All these requirements are also stated below.

There are many potential reasons for a time series to be non-stationary, including seasonalities, unit roots, deterministic trends and structural breaks. In the following section we will check and adjust our exogenous variable for each of these criteria to ensure stationarity and superior forecasting behavior.

## Seasonality

Seasonality is technically a form of non-stationarity because the mean of the time series is dependent on time factor. An example would be the spiking sales of a gift-shop around Christmas. Here the mean of the time series is explicitly dependent on time.

In order to adjust for seasonality within our exogenous variables, we first have to find out which variables actually exhibits that kind of behavior. This is done by applying a Fourier Transform.** A Fourier transform disentangles a signal into its different frequencies and assesses the power of each individual frequency**. The resulting plot, which shows power as a function of frequency is called a power spectrum. The frequency with the strongest power could then be potentially the driving seasonality in our time series. More information about Fourier transform and signal processing in general can be read up on an earlier blogpost of ours here.

The following code allows us to take a look into the power-plots of our 20 exogenous variable.

The plot below shows the resulting 20 exogenous variables. Whether or not a predominant and significant threshold is met for a variable is indicated by a red dot on top of a spike. If a red dot is visible, that means that the time series has a significantly driving frequency and therefore a strong seasonality component.

One possibility to cross-check the results of the Fourier Transforms is to plot the Autocorrelation function. If we would try have a seasonality of order X, we would expect a significant correlation with lag X. The following snippet of code plots the autocorrelation function for all features and highlights those features which are found to have a seasonal affect according to the Fourier Transform.

From the ACF plots below, we can extract a lot of useful information. First of all, we can clearly see that for all columns where the Fourier transforms find a significant seasonality, we also find confirming picture. This is because we see a peaking and significant autocorrelation at the lag which was found by the power-plot.

Additionally, we find some variables (e.g.* ndvi_nw*) which exhibit a constant significant positive autocorrelation. This is a sign of non-stationarity, which will be addressed in the next section which will be dealing of stochastic and deterministic trends.

In order to get rid of the seasonal component, we decompose each seasonality-affected feature into its unaffected version its seasonality component and trend component. This is done by the STL decomposition which was developed by Cleveland, McRae & Terpenning (1990). STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating non-linear relationships.

The following code snippet decomposes the relevant time series, and subtracts (given that we face additive seasonalities) the seasonality and the trend from the time series.

## Deterministic Trends

One more obvious way to breach the assumptions of *covariance *stationarity is if the series has a deterministic trend. It is important to stress the difference between a *deterministic *and not a *stochastic *trend (unit root). Whereas it is possible to model and remove a deterministic trend, this is not possible with a stochastic trend, given its unpredictable and random behavior.

A deterministic trend is the simplest form of a non-stationary process and time series which exhibit such a trend can be decomposed into three components:

The most common type of trend is a linear trend. It is relatively straight forward to test for such a trend and remove it, if one is found. We apply the original Mann-Kendall test, which does not consider seasonal effects, which we already omitted in the part above. If a trend is found, it is simply subtracted from the time series. These steps are completed in the method shown below.

The result can be viewed here. As we can see, most time series exhibited a linear trend, which was then removed.

Even though we removed a deterministic trend, this did not ensure that our time series are actually stationary now. That is because what works for a deterministic trend does not work for a stochastic trend, meaning that the trend-removing we just did does not ensure stationary of unit-roots.

We therefore have to explicitly test for a unit-root in every time series.

## Stochastic Trends — Unit roots

A unit root process is the generalization of the classic random walk, which is defined as the succession of random steps. Given this definition, the problem of estimating such a time series are obvious. Furthermore, a unit root process violates the *covariance* stationarity assumptions of not being dependent on time.

To see why that is the case, we assume an autoregressive model where today’s value only depends on yesterday’s value and an error term.

If we parameter **a_1 would now be equal to one**, the process would simplify to

By repeated substitution we could also write this expression as:

When now calculating the variance of y_t, we face a variance which is positively and linearly dependent on time, which violates the second *covariance* stationarity rule.

This would have not been the case if a_1 would be smaller than one. That is also basically what is tested in an unit-root test. Arguably the most well-known test for an unit root is the **Augmented Dickey Fuller (ADF)** test. This test has the null hypothesis of having a unit root present in an autoregressive model. The alternative is normally that the series is stationary or trend-stationary. Given that we already removed a (linear) trend, we assume that the alternative is a stationary series.

In order to be technically correct, it is to be said that the ADF test is not directly testing that a_1 is equal to zero, but rather looks at the characteristic equation. The equation below illustrates what is meant by that:

We can see that the difference to the equation before is that we do not look at the level of y_t, but rather at the difference of y_t. Capital Delta represent here the difference operator. The ADF is now testing whether the small delta operator is equal to zero. If that would not be the case, then the difference between yesterday’s and tomorrow’s value would depend on yesterday’s value. That would mean if the today’s value is high, the difference between today’s and tomorrow’s value will also be large which is a self-enforcing and explosive process which clearly depends on time and therefore breaks the assumptions of *covariance *stationarity.

In case of a significant unit-root (meaning a pvalue *above* 5%), we difference the time series as often as necessary until we find a stationary series. All of that is done through the following two methods.

The following table shows that we do not find any significant ADF test, meaning that no differencing was needed and that no series exhibited a significant unit root.

# Finishing up

Last but not least we take a look at our processed time series. It is nicely visible that none of the time series are trending anymore and they do not exhibit significant seasonality anymore.

Additionally we take a look at how the distributions of all of the series look. It is important to note that there are **no distributional assumptions** of the feature variables when it comes to forecasting. That means that even if we find highly skewed variables, it is not necessary to apply any transformation.

After sufficiently transforming all exogenous variables, it is now time to shift our attention on the forecasting procedure of both cities.