min read

Data availability challenges for AI model training

A very commonly heard thing about AI models is that the more data available the better. This is only the case if the model is actually able to leverage all the data. More often than not this is, unfortunately, not the case. One of the main reasons is data quality, and sensor availability specifically. In this post, we’ll discuss and dive deeper into three challenges that arise from sensor unavailability: permanently unavailable sensors, temporarily missing sensors, and new sensors and data sources.

Fig. 1 - Some of the most common causes that lead to missing sensor data. They represent tough challenges when it comes to training AI models that use all the available historical data.

We will also present the advantages of our most recent AI models which are the culmination of four years of developing online predictive analytics pipeline solutions for customers of different verticals, renewables and manufacturing industries!

At Jungle, we have seen time and time again challenges with sensor data availability in various different applications and settings. Next, we will present some of the most common causes for missing data according to our past experiences.

Permanently unavailable sensors

There is a variety of causes that can lead to sensors becoming permanently unavailable. The simplest case is when a sensor stops working and is not replaced. Other cases are when for example wind turbines have their software upgraded and some sensors are not collected or not made available anymore to the owner by the OEM.

The figure below shows a real example of a sensor becoming inaccessible (booster converter 2 power sensor) from January onwards (sensor A example from Figure 1). From that point on the data is no longer possible to be used. It is important to note that the quantity that was measured is still there, it is not possible to simply replace all the missing values with 0 degrees.

Fig. 2 - At times, sensors become unavailable and no longer produce relevant and important information needed to feed AI models.

Temporarily missing sensors

This is perhaps the type of missing sensor data that we have seen most often: sensor measurements that are temporarily not stored. When wind turbines and heavy industry electromechanical machines are shut down, their data collection systems are also turned off.

Fig. 3 - Sensor data missing during a wind turbine shut down. In this case, all sensors are missing because the entire data collection system was off.

Another case is malfunctioning sensors (represented by sensor B from Figure 1). The image below shows a sensor (top one) that is malfunctioning. It not only registers values with dynamics that are physically impossible but it also does so unreliably.

Fig. 4 - Malfunctioning sensors are more common than desired and deteriorate data quality.

Generally, AI models require a certain time period (can vary from a couple of minutes to a few hours) as context to be able to make reliable predictions. That’s why such intermittent data gaps represent a challenge for the AI models to predict and monitor the startups of assets which are a very important part of their operation. Many issues are created during the maintenance interventions, e.g. cooling systems not turned back on, and therefore the monitoring of the machines should be from the moment they are turned back on.

New sensors and data sources

At times, new sensors are installed or are made available to the data collection system (this is common during wind turbine software upgrades for example). When a new sensor is installed or made accessible it will, of course, only start to measure from that point in time onwards (represented by sensor C from Figure 1). Below we show an example of an ambient temperature sensor that was installed in 2021 in the factory line of one of our heavy industrial customers. This newly installed sensor was important to our AI models to be able to lower the predictions uncertainty of the temperature of motors bearings and couplings.

In some cases, more than 90% of the historical data was rendered useless due to the installation of new ambient temperature sensors.

As we can see in the image below, we could not use all the historical data due to the ambient temperature not being present. This meant that new AI models have to be trained from 2021 onwards.

Fig. 5 - An ambient temperature sensor was installed in 2021 in one of our heavy industrial customers. The sensor is only available from that moment onwards whilst all the other sensors go back several years.

What this means for AI model training

Missing data challenges are very important when building AI-based predictive analytics solutions. Most common AI models require all input sensors to be present at all times to be able to make predictions of the modelled sensors.

An easy to overlook detail may lead to important data being discarded and with it important dynamics and rare machine operation examples.

We cannot train an AI model that uses the entire historical data set shown in Figure 6 if it requires all sensors to be always available. We would need to segment the time range into two, which would increase the complexity of the solution by having multiple AI models and we would end up with a suboptimal usage of the available historical data. This model would not be able to learn how the machine dynamics for sensor C. Sensor C only reaches high values during the period when sensor B is absent. Sensor C does not show all operation ranges during the only suitable window for model training (when all sensors are available). This would mean that this model could lead to false-positive and negative alarms in future data since it did not learn how to model such dynamics (when sensor C reaches high values again).

Fig. 6 - Example of the impact of missing sensor data in suitable time periods for model training and its impact on learning all the asset dynamic regions.

Jungle’s approach

We felt the need to solve this challenge from the ground up. And we are very proud to say that our AI models handle all the situations described above seamlessly! We had to completely change the architecture of our models so that they could have a dynamic number of input sensors and with possible data gaps.

All the historical data is used to train our models and therefore we can make the most robust predictions with the available data.

Our models can learn the sensor dynamics when sensor C is with high values and also when sensor B is present. In this way, our models will be able to further reduce the prediction uncertainty for all time periods: before and after sensor B was made available and also for future data when sensor C displays high measured values. This is all illustrated in the figure below.

Fig. 7 - Since our model learned the dynamics of the asset with all the historical data it is able to reduce the uncertainty bands for both periods when compared to standard AI models.

What’s in it for our customers?

A few important advantages directly come from having AI models that learn from large historical datasets.

False-positive and negative alarm reduction

Reducing the number of false alarms is of paramount importance to us. Our customers have very large asset portfolios and therefore their time should always be applied to tackle the cases with the highest priority. We shared a dedicated post on why operators are ignoring SCADA alarms due to alarm fatigue here.

Our customers have, on average, several years’ worth of asset data. Since our AI models are able to learn whilst using all of it they are able to learn rare operation situations such as wind turbines working under icing conditions for example. This leads to a great reduction of false-positive and negative alarms that may arise due to the AI model not having learned all the normal operation ways of the modelled asset. Since, differently, our models learn all the normal operation dynamics, even the rare ones, our model predictions have a higher degree of confidence.

Earlier fault detection

Another advantage is the smaller uncertainty bands which can be translated into failure and abnormal behaviour detections at much earlier stages. This can reduce downtime and maintenance repair costs since catastrophic failures can be avoided (see more about this in our blog post here).

In the image below we show a concrete example of the impact of leveraging all historical data to reduce the AI model uncertainty bands. The model at the top cannot cope with missing data at the input and therefore its historical dataset was drastically reduced. The model has large confidence bands and was not able to detect a problem with a pump. In this case, our customer would not have been warned to fix the problem early.

The bottom row shows the predictions of our AI models. We can see that the confidence bands are narrower and were, therefore, able to detect the increasing pump pressure that could have led to a catastrophic failure. In this case, our customer was warned and successfully fixed the issue during a planned maintenance intervention!

Fig. 8 - The model that could not overcome the sensor challenges discussed in the post did not manage to detect an important pump problem. Our model, however, was able to detect the ongoing issue months in advance.


We have seen in this post that sensor availability is really important to develop robust AI models that leverage all the historical data. We have also shown direct cases where our models detected ongoing issues that would not have been detected otherwise.

But is it actually always good to use all the historical years of data to train an AI model? It depends on how you do it of course! As they say, with great power comes greater challenges! We will describe them in detail in a future blog post. Stay tuned!

Silvio Rodrigues

CTO & Co-founder

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.