分布式系统中基于高级机器学习架构的重复数据删除方案

2021 International Conference on Computing Sciences (ICCS) Pub Date : 2021-12-01 DOI:10.1109/ICCS54944.2021.00019

S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur

{"title":"分布式系统中基于高级机器学习架构的重复数据删除方案","authors":"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur","doi":"10.1109/ICCS54944.2021.00019","DOIUrl":null,"url":null,"abstract":"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.","PeriodicalId":340594,"journal":{"name":"2021 International Conference on Computing Sciences (ICCS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems\",\"authors\":\"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur\",\"doi\":\"10.1109/ICCS54944.2021.00019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.\",\"PeriodicalId\":340594,\"journal\":{\"name\":\"2021 International Conference on Computing Sciences (ICCS)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Computing Sciences (ICCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCS54944.2021.00019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing Sciences (ICCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCS54944.2021.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在分布式体系结构中，数据作为一种资源有其自身的价值，但是在没有交叉验证的情况下，跨多个位置持续集成大量数据以保留单个实例数据模式似乎是不可能的。因此，系统遇到了直接影响分布式劳动力的效率和性能的障碍。用户需要高质量的数据或信息，以便继续作为改进的数据服务工作，以便发现未来的趋势。但是，存储库中的重复数据条目被认为是数据分析和查询操作过程中的主要缺陷或绊脚石。因此，企业在整个重复条目检测过程中投入了大量资源来检测重复数据。我们引入了一个尖端的机器学习框架，用于检测当前和新数据条目上的重复数据。使用这种技术将文本数据输入或查询导入内存、进行预处理并转换为向量空间模型。为了将数据排列成具有相同容量的组，使用聚类K-means方法。为了在检测阶段节省时间和金钱，相似性计算是逐个集群进行的，而不是在一个庞大的数据集上进行。建议的技术性能优于现有的重复数据删除算法，最佳准确率为99.7%。如果结果测试结果与测试结果一致，则重复数据删除进程的精度性能参数越大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems

In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Computing Sciences (ICCS)

自引率

0.00%

发文量