One of the most powerful visualization tools for regional Panel data there is.
Much of the data that we interact with in our daily lives has a geographical component. Google maps records our favorite places, we calculate how many customers frequent each location for a given brand shop, and we compare eachother by regional difference. Next to bar-charts and line-charts, are there better ways that we can visualize geospatial data? A geographical heatmap can be a powerful tool for displaying temporal changes over geographies.
The question is how to build such a visualization. This tutorial explains how to build a usable GIF from panel data- time series for multiple entities, in this case German states. The visualization problem to which we apply this heatmap GIF is displaying data about the economic development (change in GDP per Capita) of Germany between 1991 and 2019. The data for this example can be found here on our GitHub.
Data Import and Basic Cleaning
To begin with, we import the relevant packages, define the relevant paths and import the needed data. Note that next to the GDP (referred to as BIP in german) per state, we also import the annual inflation rate for Germany using Quandl. The Quandl API provides freely usable financial and economic datasets. In order to use the API we have to create an Account and generate an API-Key, which then has to be specified in the data-generating command.
The reason we need the inflation rate of Germany over the years is that the GDP per Capita is currently in nominal terms, which means that it is not adjusted for inflation and therefore non-comparable between years.
After importing our data, it is now time for some basic data cleaning. This involves reshaping the data, creating the inflation adjusted GDP per Capita figures and creating an unique identifier for each state.
Why exactly reshaping the data is required can be better explained by looking at the raw dataframe, which contains the currently nominal GDP per Capita figures.
The data representation we face is referred to as a wide-format. A wide-format means that one variable (in that case the year information) is used as columns. Therefore, the entire dataset is much wider than if the variable year would be represented by only one column. This would then lead to a much longer dataset, which is why this format is referred to as the long-format. The change from a wide-format to a long-format is necessary because of our plotting tool ggplot, which could be seen as R’s equivalent to Python’s Seaborn.
After knowing why and how we have to reshape the GDP per Capita dataset, it is now time to elaborate on how to process the inflation data. For that we briefly cover what Inflation is and why it is needed.
Inflation describes the measure of the rate by which the average price level increases over time. A positive inflation rate would imply that the average cost of a certain basked of goods increased over time and that we can buy fewer things with the same amount of money. That implies that 100 Euros in 2019 can buy us less than the same amount in 1991. To make these number comparable nevertheless, we have to adjust them for inflation.
The way we measure inflation is through something which is referred to as the Consumer-Price-Index (CPI). This index represents the price of a certain basket of goods at different points in time. From the data-frame on the left side we can see that this basket cost 67.2 in 1991 and 105.8 in 2020. Given that our GDP per Capita data only ranges until 2019, and that a CPI index is normalized to a certain year, we divide all values of this data-frame by the CPI value in 2019.
The result of dividing all CPI values by the index-level of 2019 can be seen in the data-frame on the left. Now these values are easier to interpret. An average product in 1991 cost 63.5% of what it cost in 2019.
We can use these values now to adjust the GDP per Capita values over time in order to make them comparable over the years. This is done by simply dividing all GDP values by the respective inflation value of a given year.
Additionally we extract the year information from the Date column in order to match with the year information from the GDP per Capita sheet.
All of the described steps of reshaping and handling inflation are done through the code snippet below.
Importing Geospatial data for Germany
After cleaning our data and bringing it into a ggplot-friendly long-format, it is now time to import geospatial data of Germany. Geospatial or spatial data contains information needed to build a map of a location, in this case Germany.
Importing that information is done through the handy getData function. This function takes, amongst other things, the country name and the level of granularity as an input. In our case we specify “Germany” as the country and because we are interested not only in the country as a whole, but rather in the different states we specify a granularity level which also gives State information. The API gives us a so called Large SpatialPolygonsDataFrame. When opened this object looks like this:
We can see that we find 16 polygons. This makes sense given that we have 16 states in Germany. Each state is therefore represented by one Polygon.
It is important to note that the order of these 16 Polygons does not necessarily align with the alphabetical order of the German states. Therefore, we have to make sure that the information of GDP per Capita is matched up with the right Polygon. Otherwise it could be that we plot the information for e.g. Berlin in Hamburg or vice versa.
Lastly, we have to bring the information into a data frame into the long-format that ggplot prefers. This is done through the broom package, which includes the tidy function. This function, like the entire package, is for tidying up messy datatypes and bring them into a workable format.
The code for the aforementioned steps looks the following:
The resulting data frame below now shows us all the desired information. The longitude and latitude information (long and lat) is needed for ggplot to know the boundaries of a state and the id column tells us which state we are looking at.
The only remaining steps are to merge the GDP per Capita onto the mapping information and plotting it all.
Our final plotting code has to be fine-tuned to fit the purpose of the visualization. In our case, we would like to have one heatmap for every year between 1991 and 2019. To achieve that one could run the plotting code in a loop, iterating over all years, which is also what we will do later on. For better readability though, we start by showing how to plot a single year.
We start by subsetting the GDP per Capita data frame so that it only contains information for a certain year (e.g. 1991). Afterwards, we merge these 16 rows for this single year onto the mapping data frame we showed earlier. The merging variable is the id column we discussed earlier. Afterwards we can already call ggplot and fine-tune our plot.
Germany has two states (Bremen and Berlin) which are fully embedded in other, bigger states. It could therefore happen that the bigger state around them is simply plotted over the smaller state. Therefore it is important to tell ggplot that the bigger states should be plotted before these smaller states, which is also shown in the upcoming code snippet (line 7–11).
The code above will now generate and save the following image:
Putting all together into a GIF
If we would now even like to include a time component we could also create multiple heatmaps, one for every year, and play them one after another through a GIF. For that we make use of the beautiful ImageMick package which simply takes all available images in a specified folder and converts them into a GIF.
The following snippet of code shows the procedure to first create heatmaps for all years before turning them into a GIF like it can be seen below.
By tweaking these code snippets, you can create your own geographical heatmaps for other visualization challenges.