Evaluating robustly standardized explainable anomaly detection of implausible variables in cancer data.

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-04-01 DOI:10.1093/jamia/ocaf011

Philipp Röchner, Franz Rothlauf

{"title":"Evaluating robustly standardized explainable anomaly detection of implausible variables in cancer data.","authors":"Philipp Röchner, Franz Rothlauf","doi":"10.1093/jamia/ocaf011","DOIUrl":null,"url":null,"abstract":"Objectives: Explanations help to understand why anomaly detection algorithms identify data as anomalous. This study evaluates whether robustly standardized explanation scores correctly identify the implausible variables that make cancer data anomalous.Materials and methods: The dataset analyzed consists of 18 587 truncated real-world cancer registry records containing 8 categorical variables describing patients diagnosed with bladder and lung tumors. We identified 800 anomalous records using an autoencoder's per-record reconstruction error, which is a common neural network-based anomaly detection approach. For each variable of a record, we determined a robust explanation score, which indicates how anomalous the variable is. A variable's robust explanation score is the autoencoder's per-variable reconstruction error measured by cross-entropy and robustly standardized across records; that is, large reconstruction errors have a small effect on standardization. To evaluate the explanation scores, medical coders identified the implausible variables of the anomalous records. We then compare the explanation scores to the medical coders' validation in a classification and ranking setting. As baselines, we identified anomalous variables using the raw autoencoder's per-variable reconstruction error, the non-robustly standardized per-variable reconstruction error, the empirical frequency of implausible variables according to the medical coders' validation, and random selection or ranking of variables.Results: When we sort the variables by their robust explanation scores, on average, the 2.37 highest-ranked variables contain all implausible variables. For the baselines, on average, the 2.84, 2.98, 3.27, and 4.91 highest-ranked variables contain all the variables that made a record implausible.Discussion: We found that explanations based on robust explanation scores were better than or as good as the baseline explanations examined in the classification and ranking settings. Due to the international standardization of cancer data coding, we expect our results to generalize to other cancer types and registries. As we anticipate different magnitudes of per-variable autoencoder reconstruction errors in data from other medical registries and domains, these may also benefit from robustly standardizing the reconstruction errors per variable. Future work could explore methods to identify subsets of anomalous variables, addressing whether individual variables or their combinations contribute to anomalies. This direction aims to improve the interpretability and utility of anomaly detection systems.Conclusions: Robust explanation scores can improve explanations for identifying implausible variables in cancer data.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"724-735"},"PeriodicalIF":4.6000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12005620/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf011","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Explanations help to understand why anomaly detection algorithms identify data as anomalous. This study evaluates whether robustly standardized explanation scores correctly identify the implausible variables that make cancer data anomalous.

Materials and methods: The dataset analyzed consists of 18 587 truncated real-world cancer registry records containing 8 categorical variables describing patients diagnosed with bladder and lung tumors. We identified 800 anomalous records using an autoencoder's per-record reconstruction error, which is a common neural network-based anomaly detection approach. For each variable of a record, we determined a robust explanation score, which indicates how anomalous the variable is. A variable's robust explanation score is the autoencoder's per-variable reconstruction error measured by cross-entropy and robustly standardized across records; that is, large reconstruction errors have a small effect on standardization. To evaluate the explanation scores, medical coders identified the implausible variables of the anomalous records. We then compare the explanation scores to the medical coders' validation in a classification and ranking setting. As baselines, we identified anomalous variables using the raw autoencoder's per-variable reconstruction error, the non-robustly standardized per-variable reconstruction error, the empirical frequency of implausible variables according to the medical coders' validation, and random selection or ranking of variables.

Results: When we sort the variables by their robust explanation scores, on average, the 2.37 highest-ranked variables contain all implausible variables. For the baselines, on average, the 2.84, 2.98, 3.27, and 4.91 highest-ranked variables contain all the variables that made a record implausible.

Discussion: We found that explanations based on robust explanation scores were better than or as good as the baseline explanations examined in the classification and ranking settings. Due to the international standardization of cancer data coding, we expect our results to generalize to other cancer types and registries. As we anticipate different magnitudes of per-variable autoencoder reconstruction errors in data from other medical registries and domains, these may also benefit from robustly standardizing the reconstruction errors per variable. Future work could explore methods to identify subsets of anomalous variables, addressing whether individual variables or their combinations contribute to anomalies. This direction aims to improve the interpretability and utility of anomaly detection systems.

Conclusions: Robust explanation scores can improve explanations for identifying implausible variables in cancer data.

查看原文本刊更多论文

评估癌症数据中不可信变量的标准化可解释异常检测。

目的：解释有助于理解为什么异常检测算法将数据识别为异常。本研究评估了标准化解释评分是否正确地识别了使癌症数据异常的不可信变量。材料和方法：分析的数据集包括18,587个截断的真实世界癌症登记记录，其中包含8个分类变量，描述了诊断为膀胱和肺部肿瘤的患者。我们使用自动编码器的每条记录重建错误识别了800条异常记录，这是一种常见的基于神经网络的异常检测方法。对于记录的每个变量，我们确定了一个健壮的解释分数，它表明变量的异常程度。变量的鲁棒解释分数是自编码器的每个变量重构误差，通过交叉熵测量，并在记录之间进行鲁棒标准化；即重构误差大，对标准化影响小。为了评估解释分数，医疗编码员识别异常记录的不可信变量。然后，我们将解释分数与医疗编码员在分类和排名设置中的验证进行比较。作为基线，我们使用原始自编码器的每个变量重建误差、非鲁棒标准化的每个变量重建误差、根据医疗编码器验证的不可信变量的经验频率以及随机选择或排序变量来识别异常变量。结果：当我们按稳健性解释分数对变量进行排序时，平均而言，排名最高的2.37个变量包含了所有不可信的变量。对于基线，平均而言，排名最高的2.84、2.98、3.27和4.91个变量包含了所有使记录不可信的变量。讨论：我们发现基于健壮解释分数的解释比在分类和排名设置中检查的基线解释更好或同样好。由于癌症数据编码的国际标准化，我们希望我们的结果可以推广到其他癌症类型和登记处。由于我们预计来自其他医疗注册表和领域的数据中每个变量的自编码器重构误差的大小不同，这些也可能受益于每个变量的稳健标准化重构误差。未来的工作可以探索识别异常变量子集的方法，解决单个变量或它们的组合是否会导致异常。该方向旨在提高异常检测系统的可解释性和实用性。结论：稳健的解释评分可以改善对癌症数据中不可信变量的解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.