Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang, Jun Wei
{"title":"DistStream:在线-离线流聚类算法的顺序感知分布式框架","authors":"Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang, Jun Wei","doi":"10.1109/ICDCS47774.2020.00075","DOIUrl":null,"url":null,"abstract":"Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering.In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms\",\"authors\":\"Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang, Jun Wei\",\"doi\":\"10.1109/ICDCS47774.2020.00075\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering.In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.\",\"PeriodicalId\":158630,\"journal\":{\"name\":\"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS47774.2020.00075\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS47774.2020.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms
Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering.In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.