一种提高表格数据质量的自注意归算技术

IF 2.7 3区 物理与天体物理 Q2 PHYSICS, ATOMIC, MOLECULAR & CHEMICAL
Do-Hoon Lee, Han-joon Kim
{"title":"一种提高表格数据质量的自注意归算技术","authors":"Do-Hoon Lee, Han-joon Kim","doi":"10.3390/data8060102","DOIUrl":null,"url":null,"abstract":"Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"48 1","pages":"102"},"PeriodicalIF":2.7000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Self-Attention-Based Imputation Technique for Enhancing Tabular Data Quality\",\"authors\":\"Do-Hoon Lee, Han-joon Kim\",\"doi\":\"10.3390/data8060102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.\",\"PeriodicalId\":55580,\"journal\":{\"name\":\"Atomic Data and Nuclear Data Tables\",\"volume\":\"48 1\",\"pages\":\"102\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Atomic Data and Nuclear Data Tables\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.3390/data8060102\",\"RegionNum\":3,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PHYSICS, ATOMIC, MOLECULAR & CHEMICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Atomic Data and Nuclear Data Tables","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.3390/data8060102","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, ATOMIC, MOLECULAR & CHEMICAL","Score":null,"Total":0}
引用次数: 1

摘要

最近,数据驱动的决策引起了人们的极大兴趣;这需要高质量的数据集。然而,现实世界的数据集经常由于未知或故意的原因而缺失值,从而导致数据驱动的决策不准确。如果机器学习模型是使用缺失值的不完整数据集训练的,那么推断的结果可能是有偏差的。在这种情况下,一种常用的技术是缺失值插入(MVI),它用基于观测值估计的可能值填充缺失数据。使用机器学习、统计推断和关系数据库理论的各种数据输入方法已经开发出来。其中,处理表格数据的传统基于机器学习的imputation方法只能处理数值列,或者耗时且繁琐,因为它们为每个列创建了个性化的预测模型。因此,我们开发了一种新的神经网络,我们称之为去噪自注意网络(DSAN)。我们提出的DSAN可以处理包含数值列和分类列的表格数据集;它将离散的数值作为嵌入层和自关注层的分类值。此外,DSAN通过结合自关注和去噪技术来学习鲁棒特征表达向量,并可以同时预测多个适当的替代值(通过多任务学习)。为了验证该方法的有效性,我们对几个真实世界的表格数据集任意生成缺失值后进行了数据插入实验。我们评估了归因和下游任务的性能,我们已经看到DSAN优于其他模型,特别是在类别变量归因方面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Self-Attention-Based Imputation Technique for Enhancing Tabular Data Quality
Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Atomic Data and Nuclear Data Tables
Atomic Data and Nuclear Data Tables 物理-物理:核物理
CiteScore
4.50
自引率
11.10%
发文量
27
审稿时长
47 days
期刊介绍: Atomic Data and Nuclear Data Tables presents compilations of experimental and theoretical information in atomic physics, nuclear physics, and closely related fields. The journal is devoted to the publication of tables and graphs of general usefulness to researchers in both basic and applied areas. Extensive ... click here for full Aims & Scope Atomic Data and Nuclear Data Tables presents compilations of experimental and theoretical information in atomic physics, nuclear physics, and closely related fields. The journal is devoted to the publication of tables and graphs of general usefulness to researchers in both basic and applied areas. Extensive and comprehensive compilations of experimental and theoretical results are featured.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信