Data-driven approach for labelling process plant event data

IF 1 Q2 ENGINEERING, MULTIDISCIPLINARY

International Journal of Prognostics and Health Management Pub Date : 2022-01-24 DOI:10.36001/ijphm.2022.v13i1.3045

Débora C. Corrêa, A. Polpo, Michael Small, Shreyas Srikanth, Kylie Hollins, M. Hodkiewicz

{"title":"Data-driven approach for labelling process plant event data","authors":"Débora C. Corrêa, A. Polpo, Michael Small, Shreyas Srikanth, Kylie Hollins, M. Hodkiewicz","doi":"10.36001/ijphm.2022.v13i1.3045","DOIUrl":null,"url":null,"abstract":"An essential requirement in any data analysis is to have a response variable representing the aim of the analysis. Much academic work is based on laboratory or simulated data, where the experiment is controlled, and the ground truth clearly defined. This is seldom the reality for equipment performance in an industrial environment and it is common to find issues with the response variable in industry situations. We discuss this matter using a case study where the problem is to detect an asset event (failure) using data available but for which no ground truth is available from historical records. Our data frame contains measurements of 14 sensors recorded every minute from a process control system and 4 current motors on the asset of interest over a three year period. In this situation the ``how to'' label the event of interest is of fundamental importance. Different labelling strategies will generate different models with direct impact on the in-service fault detection efficacy of the resulting model. We discuss a data-driven approach to label a binary response variable (fault/anomaly detection) and compare it to a rule-based approach. Labelling of the time series was performed using dynamic time warping followed by agglomerative hierarchical clustering to group events with similar event dynamics. Both data sets have significant imbalance with 1,200,000 non-event data but only 150 events in the rule-based data set and 64 events in the data-driven data set. We study the performance of the models based on these two different labelling strategies, treating each data set independently. We describe decisions made in window-size selection, managing imbalance, hyper-parameter tuning, training and test selection, and use two models, logistic regression and random forest for event detection. We estimate useful models for both data sets. By useful, we understand that we could detect events for the first four months in the test set. However as the months progressed the performance of both models deteriorated, with an increasing number of false positives, reflecting possible changes in dynamics of the system. This work raises questions such as ``what are we detecting?'' and ``is there a right way to label?'' and presents a data driven approach to support labelling of historical events in process plant data for event detection in the absence of ground truth data.","PeriodicalId":42100,"journal":{"name":"International Journal of Prognostics and Health Management","volume":" ","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Prognostics and Health Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36001/ijphm.2022.v13i1.3045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 5

Abstract

An essential requirement in any data analysis is to have a response variable representing the aim of the analysis. Much academic work is based on laboratory or simulated data, where the experiment is controlled, and the ground truth clearly defined. This is seldom the reality for equipment performance in an industrial environment and it is common to find issues with the response variable in industry situations. We discuss this matter using a case study where the problem is to detect an asset event (failure) using data available but for which no ground truth is available from historical records. Our data frame contains measurements of 14 sensors recorded every minute from a process control system and 4 current motors on the asset of interest over a three year period. In this situation the ``how to'' label the event of interest is of fundamental importance. Different labelling strategies will generate different models with direct impact on the in-service fault detection efficacy of the resulting model. We discuss a data-driven approach to label a binary response variable (fault/anomaly detection) and compare it to a rule-based approach. Labelling of the time series was performed using dynamic time warping followed by agglomerative hierarchical clustering to group events with similar event dynamics. Both data sets have significant imbalance with 1,200,000 non-event data but only 150 events in the rule-based data set and 64 events in the data-driven data set. We study the performance of the models based on these two different labelling strategies, treating each data set independently. We describe decisions made in window-size selection, managing imbalance, hyper-parameter tuning, training and test selection, and use two models, logistic regression and random forest for event detection. We estimate useful models for both data sets. By useful, we understand that we could detect events for the first four months in the test set. However as the months progressed the performance of both models deteriorated, with an increasing number of false positives, reflecting possible changes in dynamics of the system. This work raises questions such as ``what are we detecting?'' and ``is there a right way to label?'' and presents a data driven approach to support labelling of historical events in process plant data for event detection in the absence of ground truth data.

查看原文本刊更多论文

数据驱动的方法标记过程工厂事件数据

任何数据分析的一个基本要求是有一个代表分析目的的响应变量。许多学术工作都是基于实验室或模拟数据，对实验进行控制，并明确定义了基本事实。对于工业环境中的设备性能来说，这很少是现实，而且在工业环境中，经常会发现响应变量的问题。我们通过案例研究讨论了这一问题，其中问题是使用可用数据检测资产事件（故障），但历史记录中没有可用的基本事实。我们的数据框架包含过程控制系统每分钟记录的14个传感器的测量值，以及三年内感兴趣资产上的4个电流电机。在这种情况下，“如何”给感兴趣的事件贴上标签至关重要。不同的标记策略将生成不同的模型，直接影响所生成模型的在役故障检测效果。我们讨论了一种数据驱动的方法来标记二进制响应变量（故障/异常检测），并将其与基于规则的方法进行比较。时间序列的标记是使用动态时间扭曲进行的，然后是聚集层次聚类，以将具有类似事件动态的事件分组。两个数据集都存在显著的不平衡，有1200000个非事件数据，但在基于规则的数据集中只有150个事件，在数据驱动的数据集中有64个事件。我们研究了基于这两种不同标记策略的模型的性能，分别处理每个数据集。我们描述了在窗口大小选择、管理不平衡、超参数调整、训练和测试选择方面做出的决策，并使用逻辑回归和随机森林两个模型进行事件检测。我们估计了这两个数据集的有用模型。通过有用，我们了解到我们可以在测试集中检测前四个月的事件。然而，随着时间的推移，两个模型的性能都有所恶化，误报数量不断增加，反映出系统动力学可能发生变化。这项工作提出了诸如“我们检测到了什么？”以及“有正确的标签方式吗？”并提出了一种数据驱动的方法，以支持在缺乏地面实况数据的情况下对过程工厂数据中的历史事件进行标记，用于事件检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊