分布式系统中基于高级机器学习架构的重复数据删除方案

S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur
{"title":"分布式系统中基于高级机器学习架构的重复数据删除方案","authors":"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur","doi":"10.1109/ICCS54944.2021.00019","DOIUrl":null,"url":null,"abstract":"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.","PeriodicalId":340594,"journal":{"name":"2021 International Conference on Computing Sciences (ICCS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems\",\"authors\":\"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur\",\"doi\":\"10.1109/ICCS54944.2021.00019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.\",\"PeriodicalId\":340594,\"journal\":{\"name\":\"2021 International Conference on Computing Sciences (ICCS)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Computing Sciences (ICCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCS54944.2021.00019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing Sciences (ICCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCS54944.2021.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在分布式体系结构中,数据作为一种资源有其自身的价值,但是在没有交叉验证的情况下,跨多个位置持续集成大量数据以保留单个实例数据模式似乎是不可能的。因此,系统遇到了直接影响分布式劳动力的效率和性能的障碍。用户需要高质量的数据或信息,以便继续作为改进的数据服务工作,以便发现未来的趋势。但是,存储库中的重复数据条目被认为是数据分析和查询操作过程中的主要缺陷或绊脚石。因此,企业在整个重复条目检测过程中投入了大量资源来检测重复数据。我们引入了一个尖端的机器学习框架,用于检测当前和新数据条目上的重复数据。使用这种技术将文本数据输入或查询导入内存、进行预处理并转换为向量空间模型。为了将数据排列成具有相同容量的组,使用聚类K-means方法。为了在检测阶段节省时间和金钱,相似性计算是逐个集群进行的,而不是在一个庞大的数据集上进行。建议的技术性能优于现有的重复数据删除算法,最佳准确率为99.7%。如果结果测试结果与测试结果一致,则重复数据删除进程的精度性能参数越大。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems
In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信