Large Scale Data Using K-Means

Mesopotamian Journal of Big Data Pub Date : 2023-02-13 DOI:10.58496/mjbd/2023/006

Omaima Essaad Belhaj, Raheela zaib, Ourlis Ourabah

{"title":"Large Scale Data Using K-Means","authors":"Omaima Essaad Belhaj, Raheela zaib, Ourlis Ourabah","doi":"10.58496/mjbd/2023/006","DOIUrl":null,"url":null,"abstract":"Regular data base questioning tactics are insufficient to extract meaningful data due to the exponential expansion of high layered datasets; therefore, analysts nowadays are forced to build new processes to satisfy the increased needs. Because of the development in the number of data protests as well as the expansion in the number of elements/ascribes, such vast articulation data leads to numerous new computational triggers. To increase the effectiveness and accuracy of mining activities on highly layered data, the data should be preprocessed using a successful dimensionality decrease technique. So we have collected ideas of different researchers. In several fields, cluster analysis has recently gained popularity as a method for data analysis. A popular parceling-based clustering method called K-means searches for a certain number of clusters that may be found by their centroids. However, the results are quite dependent on the original cluster focus sites. Once more, the number of distance calculations significantly grows as the complexity of the data increases. This is because building a high-precision model frequently necessitates a sizable and dispersed preparatory set. A large preparation set could also need a significant amount of preparation time. There is a trade-off between speed and accuracy when creating orders, especially for large data sets. Vector data are frequently clustered, packed, and summed using the k-means approach. We provide No Concurrent Specific Clumped K-means, a rapid and memory-effective GPU-based approach for cautious k-means (ASB K-means). In contrast to previous GPU-based k-means methods, which require stacking the entire dataset onto the GPU for clustering, our methodology may be tailored to consume far less GPU RAM than the size of the complete dataset. As a result, we may cluster datasets that are bigger than the available RAM. In order to effectively handle large datasets, the method employs a clustered architecture and applies the triangle disparity in each k-means focus to eliminate a data point on the off chance that its enrollment task, or the cluster it is a member of, remains unchanged. As a result, fewer data guides have to be sent between the Slam of the computer processor and the global memory of the GPU.","PeriodicalId":325612,"journal":{"name":"Mesopotamian Journal of Big Data","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mesopotamian Journal of Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.58496/mjbd/2023/006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Regular data base questioning tactics are insufficient to extract meaningful data due to the exponential expansion of high layered datasets; therefore, analysts nowadays are forced to build new processes to satisfy the increased needs. Because of the development in the number of data protests as well as the expansion in the number of elements/ascribes, such vast articulation data leads to numerous new computational triggers. To increase the effectiveness and accuracy of mining activities on highly layered data, the data should be preprocessed using a successful dimensionality decrease technique. So we have collected ideas of different researchers. In several fields, cluster analysis has recently gained popularity as a method for data analysis. A popular parceling-based clustering method called K-means searches for a certain number of clusters that may be found by their centroids. However, the results are quite dependent on the original cluster focus sites. Once more, the number of distance calculations significantly grows as the complexity of the data increases. This is because building a high-precision model frequently necessitates a sizable and dispersed preparatory set. A large preparation set could also need a significant amount of preparation time. There is a trade-off between speed and accuracy when creating orders, especially for large data sets. Vector data are frequently clustered, packed, and summed using the k-means approach. We provide No Concurrent Specific Clumped K-means, a rapid and memory-effective GPU-based approach for cautious k-means (ASB K-means). In contrast to previous GPU-based k-means methods, which require stacking the entire dataset onto the GPU for clustering, our methodology may be tailored to consume far less GPU RAM than the size of the complete dataset. As a result, we may cluster datasets that are bigger than the available RAM. In order to effectively handle large datasets, the method employs a clustered architecture and applies the triangle disparity in each k-means focus to eliminate a data point on the off chance that its enrollment task, or the cluster it is a member of, remains unchanged. As a result, fewer data guides have to be sent between the Slam of the computer processor and the global memory of the GPU.

查看原文本刊更多论文

使用K-Means的大规模数据

由于多层数据集呈指数级增长，常规的数据库提问策略不足以提取有意义的数据;因此，分析人员现在被迫构建新的过程来满足增加的需求。由于数据抗议数量的增加以及元素/归属数量的增加，如此庞大的衔接数据导致了许多新的计算触发器。为了提高对高分层数据的挖掘活动的有效性和准确性，应该使用成功的降维技术对数据进行预处理。所以我们收集了不同研究者的想法。在一些领域，聚类分析作为一种数据分析方法最近得到了普及。一种流行的基于包裹的聚类方法叫做K-means，它通过聚类的质心来搜索一定数量的聚类。然而，结果很大程度上依赖于原有的集群焦点位置。同样，距离计算的数量随着数据复杂性的增加而显著增加。这是因为建立一个高精度的模型经常需要一个相当大的分散的准备集。一个大的准备集也可能需要大量的准备时间。在创建订单时，速度和准确性之间存在权衡，特别是对于大型数据集。向量数据经常使用k-means方法聚类、打包和求和。我们提供了No Concurrent Specific clustered K-means，这是一种快速且内存有效的基于gpu的谨慎K-means (ASB K-means)方法。与之前基于GPU的k-means方法(需要将整个数据集堆叠到GPU上进行聚类)相比，我们的方法可以定制为消耗远少于完整数据集大小的GPU RAM。因此，我们可能会对大于可用RAM的数据集进行聚类。为了有效地处理大型数据集，该方法采用集群架构，并在每个k-means焦点中应用三角形差异来消除一个数据点，因为它的登记任务或它所属的集群保持不变。因此，更少的数据指南必须在计算机处理器的Slam和GPU的全局内存之间发送。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mesopotamian Journal of Big Data

自引率

0.00%

发文量