{"title":"Research on clustering algorithm based on spark","authors":"Kun Lang, Xiaoli Chai","doi":"10.1109/ICCECE58074.2023.10135496","DOIUrl":null,"url":null,"abstract":"With the rapid development of sensors and positioning technology, a huge amount of GPS data generates every day and night. Taking cabs as an example, behind the GPS track information of cabs, there is a large amount of information to be mined, which is crucial for urban governance and consumer behavior analysis. In this paper, we will analyze point data of cab with clustering algorithm, optimize K-means by utilizing the Canopy algorithm for pre-clustering, and parallelize the implementation of the algorithm based on the spark framework. Experiments show that the improved clustering algorithm works well, and the computational efficiency and speedup also improve effectively.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the rapid development of sensors and positioning technology, a huge amount of GPS data generates every day and night. Taking cabs as an example, behind the GPS track information of cabs, there is a large amount of information to be mined, which is crucial for urban governance and consumer behavior analysis. In this paper, we will analyze point data of cab with clustering algorithm, optimize K-means by utilizing the Canopy algorithm for pre-clustering, and parallelize the implementation of the algorithm based on the spark framework. Experiments show that the improved clustering algorithm works well, and the computational efficiency and speedup also improve effectively.