使用NLP的基于规则的自动数据清理

Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis
{"title":"使用NLP的基于规则的自动数据清理","authors":"Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis","doi":"10.23919/FRUCT56874.2022.9953810","DOIUrl":null,"url":null,"abstract":"Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.","PeriodicalId":274664,"journal":{"name":"2022 32nd Conference of Open Innovations Association (FRUCT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Automated Rule-Based Data Cleaning Using NLP\",\"authors\":\"Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis\",\"doi\":\"10.23919/FRUCT56874.2022.9953810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.\",\"PeriodicalId\":274664,\"journal\":{\"name\":\"2022 32nd Conference of Open Innovations Association (FRUCT)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 32nd Conference of Open Innovations Association (FRUCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/FRUCT56874.2022.9953810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd Conference of Open Innovations Association (FRUCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FRUCT56874.2022.9953810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

数据清洗是近年来蓬勃发展的数据挖掘的一个子领域。确保数据在生成或接收时的可靠性对于向用户提供尽可能好的服务至关重要。完成上述任务说起来容易做起来难,因为数据非常复杂,生成速度非常快,而且规模巨大。各种各样的技术和方法是计算机科学领域的其他子领域的一部分,已经被调用来帮助使数据清理尽可能高效和有效。这些子领域包括自然语言处理(NLP),本质上是指计算机和人类语言之间的交互,寻求找到一种方法来编程计算机,使其能够处理和分析大量的人类语言数据。NLP是一个存在了很长时间的概念,但随着时间的推移,人们提出它可以应用于各种不仅仅与NLP相关的概念。本文提出了一种基于规则的数据清洗机制,利用自然语言处理来保证数据的可靠性。使用NLP不仅使该机制非常有效,而且与其他不使用NLP的相应机制相比,效率要高得多。该机制在不同的医疗保健数据集上进行了评估,但并不局限于医疗保健领域,而是支持广义的数据清理概念。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automated Rule-Based Data Cleaning Using NLP
Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信