加权低秩近似的应用:数据矩阵中的离群值检测。

IF 1.6 Q2 MULTIDISCIPLINARY SCIENCES
Marisol García-Peña, Sergio Arciniegas-Alarcón, Kaye E Basford
{"title":"加权低秩近似的应用:数据矩阵中的离群值检测。","authors":"Marisol García-Peña, Sergio Arciniegas-Alarcón, Kaye E Basford","doi":"10.1186/s13104-025-07284-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear.</p><p><strong>Results: </strong>Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.</p>","PeriodicalId":9234,"journal":{"name":"BMC Research Notes","volume":"18 1","pages":"234"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12107823/pdf/","citationCount":"0","resultStr":"{\"title\":\"Application of weighted low rank approximations: outlier detection in a data matrix.\",\"authors\":\"Marisol García-Peña, Sergio Arciniegas-Alarcón, Kaye E Basford\",\"doi\":\"10.1186/s13104-025-07284-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear.</p><p><strong>Results: </strong>Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.</p>\",\"PeriodicalId\":9234,\"journal\":{\"name\":\"BMC Research Notes\",\"volume\":\"18 1\",\"pages\":\"234\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12107823/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Research Notes\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13104-025-07284-2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Research Notes","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13104-025-07284-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

目的:对任何矩形数据库进行探索性分析的必要步骤是识别可能的异常值。它们的存在定义了随后应该使用哪种类型的解释和/或预测建模。本文提出了利用矩阵的加权近似来识别任何数据集中的异常值的策略。通过人工污染对16个真实数据集进行了评估,其中2个数据集具有多变量特征,14个数据集来自多环境试验。提出了一种统计量作为评价标准,当检测方法好时,其值较小,当出现假阳性或假阴性时,其值较大。结果:考虑了识别加权近似异常值的6个标准,包括简单残差、差权平方残差、Jackknife及其对应的迭代版本,并基于偏差调整箱线图的限制将其与金标准进行了比较。所有方法都适用于任何以矩阵形式书写的数值数据集,例如基因型与环境相互作用的实验。研究发现,在具有数值条目的矩阵中存在随机离群值时,使用加权近似识别离群值比基于偏差调整箱线图限制的检测更有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Application of weighted low rank approximations: outlier detection in a data matrix.

Objective: A mandatory step in the exploratory analysis of any rectangular database is the identification of possible outliers. The presence of these defines what type of explanatory and/or predictive modeling should be used subsequently. This paper presents strategies to identify outliers in any data set using weighted approximations of a matrix. The strategies are evaluated through artificial contamination in sixteen real data sets, of which two have multivariate characteristics and fourteen come from multi-environment trials. As an evaluation criterion, a statistic is proposed such that its value is small when the detection method is good and it is large when false positives or false negatives appear.

Results: Six criteria for identifying outliers from weighted approximations were considered, including simple residuals, squared residuals with differential weights, Jackknife and their corresponding iterative versions, and they were compared with the gold standard one based on limits from a bias-adjusted boxplot. All methods are applicable to any numerical data set written in matrix form, e.g. experiments with genotype-by-environment interaction. It was found that in the presence of random outliers in a matrix with numerical entries, the identification of outliers using weighted approximations is more effective than detection based on limits from a bias-adjusted boxplot.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Research Notes
BMC Research Notes Biochemistry, Genetics and Molecular Biology-Biochemistry, Genetics and Molecular Biology (all)
CiteScore
3.60
自引率
0.00%
发文量
363
审稿时长
15 weeks
期刊介绍: BMC Research Notes publishes scientifically valid research outputs that cannot be considered as full research or methodology articles. We support the research community across all scientific and clinical disciplines by providing an open access forum for sharing data and useful information; this includes, but is not limited to, updates to previous work, additions to established methods, short publications, null results, research proposals and data management plans.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信