Air Pollution Forecasting with TFT

Marija Todosovska
Netcetera Tech Blog
11 min readMar 24, 2022

--

During their internship at Netcetera, Vasil Dedejski and Petar Tushev, under the mentorship of Marija Todosovska, our senior Data Scientist, developed a solution to the air pollution forecasting problem. In this blog, they present the entire process and give a short explanation of the tools and models used during the development.

Introduction

Time series data, dependant of a time element, and containing data points that occur in a successive order, lands itself naturally to the method of time series forecasting. Time series forecasting is generally used when we can use past data points to predict future ones.

While there are a wide variety of approaches to solving this problem, some of them clearly stand out. One such well-established method is the Temporal Fusion Transformer (TFT), developed by Google in 2019. TFT is an attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics.

In the remainder of this article we will present our entire process:

  1. Data collection
  2. Data preprocessing
  3. Data analysis
  4. Feature engineering
  5. Baseline models
  6. TFT for air pollution
  7. Integration with pulse.eco
Image 1: Project flow
Image 1: Project flow.

Data Collection

pulse.eco, powered by n-things by netcetera, is an online platform which supports a network of IoT sensors throughout various cities. It supports multiple metrics for the quality of the air and environment, and presents them in real time. Most of the data pulse.eco uses is provided by the community, anyone can attach a sensor to the network, and have the information about their environment in real time.

From the pulse.eco about section:

Pulse.eco is a crowdsourcing platform, which gathers and presents environmental data. Our network of sensor installations and other third-party sources gathers the data and translates them into visual and easy to understand information. You can learn about the pollution, humidity, temperature or noise in your surroundings with just a few clicks. Even better, you can participate in expanding our network and setup your own devices, to enrich the data sourcing.

Image 2: pulse.eco overview of the air pollution (PM 10 particles) over Skopje.

Thanks to the pulse.eco API service, we were able to gather the air pollution data for Skopje, for the last four years: from 2018 to 2021. This resulted in a dataset of 1.863.485 entries with 54 unique sensors. Each entry contains the following fields: sensor_id, time stamp, pollution value, with additional information on the coordinates for each sensor.

One of the main things we were interested into experimenting with, was what kind of effect weather data would have on the success of the predictions of air pollution. In that respect, we collected the historical weather data for the time period of the pulse.eco data. We merged this data into our previous dataset, and obtained a dataset with the following features:

  • timestamp: a datetime value for the time when the measuring was taken
  • sensor_id: unique identification of each pulse.eco sensor
  • value: pm10 value of sensor for the given timestamp
  • temp: temperature (with frequency of one hour) corresponding to the timestamp
  • feelslike: a function of temp, relying on multiple factors
  • dew: deposit of water droplets formed at night by the condensation of water vapour from the air onto the surfaces of exposed objects
  • precip: any product of the condensation of atmospheric water vapour that falls under gravitational pull from clouds.
  • windspeed: (in km/h)
  • winddir: wind direction (in degrees 0–360)
  • cloudcover: percentage of cloud cover
  • visibility: the greatest distance through the atmosphere toward the horizon at which prominent objects can be identified with the naked eye (in kilometres)
  • solarradiation: at the moment of observation (power in W/m²)
  • conditions: notable weather conditions reported at the location

Data Preprocessing

Once we had gathered the raw data, we had to preprocess it in order to be able to use it for training a model. The preprocessing included the following major steps:

  • Standardising the frequency of measurements
  • Interpolation of missing values
  • Outlier detection and removal

Frequency of measurements

In the raw dataset, the times of measurement were spaced out at unequal distances of each other. In order to create a time series dataset, we had to make sure the distances between two data points were equal. Given that the data was very granular (multiple values per hour), we smoothed it to the granularity of one hour. There is one entry per sensor per hour.

Interpolation of missing values

Sensors can have moments when they are off (for shorter of longer periods of time). During these times, the sensors cannot record the environmental factors. This creates gaps in the dataset.

Image 3: missing data for every sensor. Sensors are on the x-axis and timestamps on the y-axis.

In order to fix that, and have a proper time series dataset, we needed to fill up these gaps. Because there was location data for every sensor — this became a spatial interpolation problem. In this direction, we explored multiple methods such as K-nearest neighbours (KNN), Triangular Interpolation Networks (TIN), Geostatistical interpolation (Kriging) and Inverse Distance Weighing (IDW). Of these methods the one that was most suitable for our needs was IDW.

IDW encodes a weight bias for the value of the neighbours of each sensor (for the ones for which we have available readings). Then, we determine the interpolated value of a sensor by taking the mean of its neighbours’ weighted values. This is shown in Image 4.

Image 2: Inverse distance weighing (IDW)
Image 4: Inverse distance weighing (IDW)

Outlier Removal

For a problem of this nature, it is normal to expect one of two things happening: either something extraordinary can happen in the environment, or the sensor can produce a faulty measurement and produce extreme values. However it is, these values can cause bias in our model, and need to be removed.

Spectral Residual as Anomaly Detection (sranodec) is an algorithm introduced by Microsoft, which is particularly well adjusted to time series outlier detection. The algorithm is based on Spectral Residuals and Convolutional Neural Networks.

Data Analysis

After we have performed the basic preprocessing and cleaning of the dataset, we would like to gather more information about the relationships between the time series points themselves, so that we can model them within the feature engineering later. One thing that can tell us a lot about this, very quickly, is Seasonal Decomposition. We used the seasonal_decompose module from statsmodels. seasonal_decompose gives us knowledge about the data in three parts: trend, seasonality, and residual.

Image 5: The values for air pollution for the preprocessed data, during one year.

Trend

Trend is a pattern in the data that shows its general movement in an upwards or downwards direction. Our data showed a steady downwards trend, during the first half of the year, and a very small upwards trend during the latter half.

Image 6: The trend for the preprocessed data, during one year.

Seasonality

Seasonality is a characteristic of a time series in which the data shows regular and predictable changes that recur every period. In this part of the analysis, we can confirm that the data shows a monthly seasonality (as shown on Image 7).

Image 7: The seasonality for the preprocessed data, for one year, with freq=30.

If we set the freq parameter of the seasonal_decompose function to 7 (to represent one week of data), we can get the plot on Image 8.

Image 8: The seasonality for the preprocessed data, for one year, with freq=7.

We can see here that there is a weekly seasonality as well. Knowing this, we can later create features to reflect it.

Residual

Residual is the part of the time series left unexplained by the trend and seasonality. It should be randomly distributed and impossible to predict (if it isn’t, we likely have a problem with the decomposition).

Trend + Seasonality + Residual = Time Series

Image 9: The residual of the preprocessed data, for one year, with freq=30.

Autocorrelation

Autocorrelation is the correlation between the data point and previous data points within the time series. It can tell us which points in the past had the biggest effect on the current data point, thus, we can add them as features, and use them to predict the current (or future) data points.

Image 10: autocorrelation of our dataset for 2021.

On image 10, we can see that all previous values have some influence over the current data point. However, this could be the result of the compounding of influences. This is the reason we checked the Partial Autocorrelation as well. Partial Autocorrelation builds upon Autocorrelation, however, it removes the compounded influences that ripple through data points. For example, if point 1 has influence over point 3, and point 3 has influence over point 5, Autocorrelation will report this as point 1 having influence over point 5. Whereas, Partial Autocorrelation will distinguish between the 2, and only report the influence between point 1 and point 3, point 3 and point 5.

Image 11: partial autocorrelation of our dataset for 2021.

From image 11, we can see that around the 10th data point, the partial autocorrelation is small enough to enter the zone of non-significance. Thus, it would make sense to use up to 10 lags as features.

Feature Engineering

The next step in our process was engineering the features. In this respect, we wanted to experiment with 2 sets of features: temporal and weather features.

Temporal Features

The temporal features would depend on the timestamps themselves: the hour of the day, day of the month, month of the year, etc. These features, however, need to be encoded in a way that would capture their cyclical nature (Sunday and Monday are very close to each other. Encoding them as 0 and 6 would not be able to represent that). To solve this problem, we encoded them as cyclical features, more specifically we used the cosine function because it encompasses the interval between -1 and 1, and encodes it within a 0–2 Pi cycle.

Image 4: Time series feature engineering of a sample timestamp
Image 12: deconstruction of timestamps into temporal features.

Weather Features

One of our hypotheses, when starting this project was that adding weather features to the model would have a positive effect on the prediction capabilities of the model. To that end, we explained above that we gathered weather data and merged it to the data set. We encoded the fields present in this part of the dataset, and used them as features.

Baseline Models

Because TFT is a fairly complex model, we first needed to have baseline models, to make sure that whatever gains we had were worth the complexity of a model like TFT.

Lags as a baseline model

The simplest baseline we can employ is taking as the predicted value, the real value of the previous hour.

Random Forest Regressor

As well as taking the simplest model possible, we trained a simple Random Forest Regressor as a sanity check option.

Image 13: simple representation of the Random Forest baseline.

Temporal Fusion Transformer

The TFT model was introduced in 2019 by Google engineers. It provides the possibility for both multi-horizon forecasting and interpretable predictions. Recurrent layers and interpretable self-attention layers are used to model temporal relationships at different scales. Relevant features are selected through specialised components and a series of gating layers is used to suppress unnecessary components.

Bellow we explain the different kinds of components present in the architecture:

  • LSTM sequence-to-sequence encoders/decoders to summarise short patterns. The LSTM blocks are used to identify relationships of data points in their local environment.
  • Temporal multi-head attention block that identifies long-term dependencies. It allows for the model to prioritise them, selecting the most relevant patterns. Each of the attention heads can focus on a different temporal pattern.
  • GRN (Gated residual network) blocks are used to weed out the unimportant and unused inputs. They also randomly use dropout to prevent overfitting.

The temporal fusion decoder combines these specialised layers to learn the relationships along the time axis. Also, TFT minimises a quantile loss function, which enables it to generate a probabilistic forecast.

Image 14: TFT Architecture (Bryan Lim et al, 2020)

Image 14 explained by the TFT paper:

TFT inputs static metadata, time-varying past inputs and time-varying a priori known future inputs. Variable Selection is used for judicious selection of the most salient features based on the input. Gated Residual Network blocks enable efficient information flow with skip connections and gating layers. Time-dependent processing is based on LSTMs for local processing, and multi-head attention for integrating information from any time step.

Model

Our model build upon the original TFT architecture. To do this, we used the PyTorch forecasting package. It is PyTorch’s time series forecasting module, which implements state-of-the-art architectures and approaches.

We took the next steps towards customising the model for our data:

  • We used an encoder length of one week (7*24 hours) and a maximum prediction length of 48 hours.
  • Although the learning rate can be calculated automatically using the PyTorch Lightning learning rate finder, we discovered that the optimal learning rate tended to be slightly lower than the suggested one. Because of this, we set it manually.
  • For tuning of the hyper-parameters we used the optuna hyper-parameter optimisation framework, and the function optimize_hyperparameters from PyTorch Forecasting.

Results

Image 15: comparison of the real values (blue) and predicted values (red).

From the results on image 15, we can see that our model performs well. There is an obvious dependence on the lagged values. This is confirmed by taking the feature importance, which shows them as the most important features.

Integration

Our end goal for this phase of the project was to integrate it into the already existing pulse.eco universe. To do this we would utilise API endpoints and simple inference.

Inference

To do inference for this model it is necessary to designate encoder and decoder data with the covariates. To obtain the encoder data we used the RestAPI Docs of pulse.eco in combination with the weather data. We used weather forecasts to fill in the covariates for the decoder data.

Image 16: Plot generated with encoder_data value (blue line) and predictions on decoder data (red line).

Integrating a prediction service into pulse.eco will allow for its users to use it not only for historical and current information, but also for planning out their days. Figuring out the best time to go for a walk or do a sports activity can make people both healthier and less stressed.

Image 17: A pulse.eco mock-up version in which there are air pollution forecasts for the days ahead

Thank you for reading from our whole team (Vasil Dedejski, Petar Tushev, and Marija Todosovska). Thank you also to the pulse.eco team and to Netcetera’s AI/ML team for supporting us.

--

--