{"title":"分布式系统中基于高级机器学习架构的重复数据删除方案","authors":"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur","doi":"10.1109/ICCS54944.2021.00019","DOIUrl":null,"url":null,"abstract":"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.","PeriodicalId":340594,"journal":{"name":"2021 International Conference on Computing Sciences (ICCS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems\",\"authors\":\"S. Tarun, Ranbir Singh Batth, Sukhpreet Kaur\",\"doi\":\"10.1109/ICCS54944.2021.00019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.\",\"PeriodicalId\":340594,\"journal\":{\"name\":\"2021 International Conference on Computing Sciences (ICCS)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Computing Sciences (ICCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCS54944.2021.00019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing Sciences (ICCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCS54944.2021.00019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Scheme for Data Deduplication Using Advance Machine Learning Architecture in Distributed Systems
In a distributed architecture, data as a resource has its own value, but continuous integration of large amounts of data across several locations without cross-verification to preserve a single instance data pattern appears impossible. As a result, systems have encountered hurdles that have a direct influence on the efficiency and performance of distributed workforces. Users need high-quality data or information in order to continue working as improved data services in order to find future trends. However, duplicate data entries in storage repositories are considered a major flaw or stumbling block in the data analysis and query operations processes. As a result, businesses have invested significant resources in detecting duplicate data throughout the duplicate entry detection process. We've introduced a cutting-edge machine learning framework for detecting duplicate data on both current and new data entries. Textual data inputs or queries are imported into memory, preprocessed, and transformed to a vector space model using this technique. To arrange data in groups with equal capacity, a clustering K-means approach is used. To save time and money during the detection phase, similarity computations were done cluster-by-cluster rather than on a huge dataset. The suggested technique performs better than existing deduplication algorithms, with an optimal accuracy of 99.7%. If the result-test and gt-test outcomes are determined to be same during comparison, the accuracy performance parameter of the deduplication process is greater.