In part 1 of this series, we explained why SCADA alarms are not helpful and end up creating alarm fatigue for operators. We also introduced the type of alarms that are desirable: alarms that are dynamic and adjust to your assets and their continuously changing operational conditions.
In this second part, we will demonstrate the earlier detection capabilities of advanced alarm systems developed on top of the predictive uncertainty of our normality models. For example, sustained periods of abnormality and shorter periods of high abnormality could be used directly to trigger an alarm. Unfortunately, the available data, the sensors, the developed models, and the assets themselves are not perfect (it’s the real world after all). As a result, despite being a step further from conventional SCADA alarms, the number of alarms can still be overwhelming.
The correlations between the different sensors are one of the most overlooked aspects of the data when digital twins are developed. Assuming that the sensors operate independently is an important decision taken by people, most of the time, without even realizing it. This simple decision taken usually at the early stages of the development of ML models will have great implications later on.
Let us consider the case in which we are modelling two different sensors (A and B) as shown in the image below. If we model them independently we will not be aware of the normal joint region of operation (shown in orange). This would lead us to define univariate alarms for both sensors.
In this univariate situation, point A would be generating an alarm for both sensors since it sits on the high value for both of them. This would represent however a false positive alarm that would be only avoided when looking at the joint operation of the two sensors.
It is important to note that we are using only two sensors for ease of visualization and to have a simple example of sensor interactions. In reality, these components have high multivariate dimensional data (dozens to hundreds of sensors) and show many complex dynamics that are not intuitive for operators.
Relying solely on the output of the normal behaviour model for each sensor individually to trigger an alarm is not enough as certain fault modes change the component dynamics while individual sensors behave within expectations. But the good news is… we can do better!
Let us investigate a different situation: an alarm that is only detectable under a multivariate strategy. Looking at both individual sensors, point B represents sensor readings that are well within the expected range. However, if we take the correlation between the sensors, we can see that point B is highly unexpected to occur, i.e. the pair of sensor 1 and 2 readings is unexpected. This would have been a false negative under the univariate alarm mechanism.
Multi-sensor based alarms allow for less false positives and false negatives by taking sensor correlations into account.
With the above challenges in mind, we have devised a component Normality KPI that, besides evaluating how long individual sensors are (or are not) within the expected values, considers different signal dynamics (How much is it diverging? How often is it diverging? How many sensors are diverging?) and the relationships between them, as shown in Figure 3.
With the component Normality KPI in place, data, models, and assets imperfections are overcome and fewer but meaningful and actionable alarms are triggered. Reducing alarm fatigue and creating more focus on important problems. Look, for a moment, at the right side of Figure 4 where component Normality KPI based alarms are shown. Now compare it to the traditional SCADA alarms for the same (yes, the same!) wind turbines on the left side. Looks much more insightful and less overwhelming right?
Not only the produced alarms are fewer and more meaningful, but they are also more actionable. As we are able to give our users extra context such as in which operating conditions a certain asset is typically alarming (e.g. temperature is higher than expected, only when the turbine is producing at maximum capacity).
Curious about the insights our actionable alarms are able to distil? Using these alarms, we have successfully identified many failures of different nature in various applications such as heavy industry machinery and wind turbines. Below we show an example of such findings.
Figure 5 represents the component KPI described above, i.e. it is an alarm that has as basis the correlations between all the sensors belonging to a turbine gearbox. Light red means a medium alarms level while dark red represents a high alarm. All of the other rows display univariate alarms defined on each individual sensor. The colour blue means that the sensor readings are below what the model expected and red means that the sensor is above expected.
Figure 6 shows the sensor measurements and the respective expected bands (univariate alarms) for different moments in time for the oil pressure and one of the gearbox bearing temperatures. There are four distinct time periods shown as A to D letters in Figures 5 and 6. We also show these time periods in terms of alarms in Figure 7.
All sensors behave as expected and nicely follow our model prediction bands during this period. Furthermore, there is no component level alarm meaning that the relationships among the sensors are normal (see point A in Figure 7).
Despite having no individual sensor alarms and both sensors falling inside their predictions bands, we have a continuous high multi-alarm at the component level during period B. This is the most challenging stage for an operator to detect nascent anomalies. Everything looks normal, but our model is sure that that’s not the case. Only if we squint our eyes we can see that the oil pressure sensor readings are starting to flirt with the bottom part of its prediction band!
This subtle period of abnormal behaviour is only identified by our component level alarm since the deviation is mostly driven by the change in sensor correlations and not by anindividual sensor. This gives operators an extra predictive window of 1.5 months to assess further the situation and prepare service maintenance. At this stage of failure development, there will be no serious impact done on the turbine components.
For the first time, during this time period, we see one sensor deviating from its bands (see Figure 6). We see that the pressure sensor is below the expected range, hence we also have a respective blue alarm in Figure 5. At this point no other sensor is alarming. Period C also lasts 1.5 months.
The failure developed further during period D since no intervention was made to the turbine. The oil pressure was substantially lower than our model predictions and we also have multiple gearbox temperatures above normal levels. This deviation pattern, or as we called - deviation signature -, clearly paints the picture to the operator! There is a leakage in the oil system which has led to the oil pressure drop and eventually to an increase of the gearbox temperatures. Period D lasted approximately 2 months until there was a maintenance intervention to solve the ongoing issue.
Surprisingly (or not!), the SCADA events are unable to detect the (seemingly obvious, according to our model) failure development. Despite the thousands of SCADA logs during these 5 months, no relevant or informative alarms were generated regarding low oil pressure or high gearbox temperatures.
We have detected, on a consistent basis, a gearbox issue 5 months before any maintenance intervention. This early warning would not have been possible when looking at any, or all, of the sensor readings independently. Based solely on single-sensor alarms, one would not have been able to generate a global alarm with a high degree of confidence.
The component KPI alarm enables us to detect real-world failures in earlier stages of development and with higher confidence, increasing the time to prepare for a maintenance intervention. This allows service teams to have shorter, better-informed, and targeted maintenance interventions, leading to significant cost savings.
Main advantages for our users
Do you think your operators could benefit from lighter and more meaningful alarms and notifications? We can help them sharpen their gaze on the most critical issues. Don’t hesitate to reach out via firstname.lastname@example.org!