使用NLP的基于规则的自动数据清理

2022 32nd Conference of Open Innovations Association (FRUCT) Pub Date : 2022-11-09 DOI:10.23919/FRUCT56874.2022.9953810

Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis

{"title":"使用NLP的基于规则的自动数据清理","authors":"Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis","doi":"10.23919/FRUCT56874.2022.9953810","DOIUrl":null,"url":null,"abstract":"Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.","PeriodicalId":274664,"journal":{"name":"2022 32nd Conference of Open Innovations Association (FRUCT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Automated Rule-Based Data Cleaning Using NLP\",\"authors\":\"Konstantinos Mavrogiorgos, Argyro Mavrogiorgou, Athanasios Kiourtis, N. Zafeiropoulos, S. Kleftakis, D. Kyriazis\",\"doi\":\"10.23919/FRUCT56874.2022.9953810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.\",\"PeriodicalId\":274664,\"journal\":{\"name\":\"2022 32nd Conference of Open Innovations Association (FRUCT)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 32nd Conference of Open Innovations Association (FRUCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/FRUCT56874.2022.9953810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd Conference of Open Innovations Association (FRUCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FRUCT56874.2022.9953810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

数据清洗是近年来蓬勃发展的数据挖掘的一个子领域。确保数据在生成或接收时的可靠性对于向用户提供尽可能好的服务至关重要。完成上述任务说起来容易做起来难，因为数据非常复杂，生成速度非常快，而且规模巨大。各种各样的技术和方法是计算机科学领域的其他子领域的一部分，已经被调用来帮助使数据清理尽可能高效和有效。这些子领域包括自然语言处理(NLP)，本质上是指计算机和人类语言之间的交互，寻求找到一种方法来编程计算机，使其能够处理和分析大量的人类语言数据。NLP是一个存在了很长时间的概念，但随着时间的推移，人们提出它可以应用于各种不仅仅与NLP相关的概念。本文提出了一种基于规则的数据清洗机制，利用自然语言处理来保证数据的可靠性。使用NLP不仅使该机制非常有效，而且与其他不使用NLP的相应机制相比，效率要高得多。该机制在不同的医疗保健数据集上进行了评估，但并不局限于医疗保健领域，而是支持广义的数据清理概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated Rule-Based Data Cleaning Using NLP

Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 32nd Conference of Open Innovations Association (FRUCT)

自引率

0.00%

发文量