Tiago Luis Andrade, Rogéria Cristiane Gratão de Souza, Maurizio Babini, C. R. Valêncio
{"title":"Optimization of Algorithm to Identification of Duplicate Tuples through Similarity Phonetic Based on Multithreading","authors":"Tiago Luis Andrade, Rogéria Cristiane Gratão de Souza, Maurizio Babini, C. R. Valêncio","doi":"10.1109/PDCAT.2011.58","DOIUrl":null,"url":null,"abstract":"Aiming to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Databases (KDD) and is responsible for eliminating problems and adjust the data for the later stages, especially for the stage of data mining. Such problems occur in the instance level and schema, namely, missing values, null values, duplicate tuples, values outside the domain, among others. Several algorithms were developed to perform the cleaning step in databases, some of them were developed specifically to work with the phonetics of words, since a word can be written in different ways. Within this perspective, this work presents as original contribution an optimization of algorithm for the detection of duplicate tuples in databases through phonetic based on multithreading without the need for trained data, as well as an independent environment of language to be supported for this.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Aiming to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Databases (KDD) and is responsible for eliminating problems and adjust the data for the later stages, especially for the stage of data mining. Such problems occur in the instance level and schema, namely, missing values, null values, duplicate tuples, values outside the domain, among others. Several algorithms were developed to perform the cleaning step in databases, some of them were developed specifically to work with the phonetics of words, since a word can be written in different ways. Within this perspective, this work presents as original contribution an optimization of algorithm for the detection of duplicate tuples in databases through phonetic based on multithreading without the need for trained data, as well as an independent environment of language to be supported for this.
为了保证存储在数据库中的数据具有更高的可靠性和一致性,在KDD (Knowledge Discovery in Databases)过程的早期设置了数据清洗阶段,负责为后续阶段,特别是数据挖掘阶段消除问题和调整数据。此类问题发生在实例级和模式中,即缺失值、空值、重复元组、域外值等。开发了几种算法来执行数据库中的清理步骤,其中一些是专门用于处理单词的语音的,因为一个单词可以用不同的方式书写。从这个角度来看,本工作提出了一个基于多线程的语音检测数据库中重复元组的算法优化,而不需要训练数据,并为此提供了一个独立的语言环境。