Dajung Lee, Alric Althoff, D. Richmond, R. Kastner
{"title":"A streaming clustering approach using a heterogeneous system for big data analysis","authors":"Dajung Lee, Alric Althoff, D. Richmond, R. Kastner","doi":"10.1109/ICCAD.2017.8203845","DOIUrl":null,"url":null,"abstract":"Data clustering is a fundamental challenge in data analytics. It is the main task in exploratory data mining and a core technique in machine learning. As the volume, variety, velocity, and variability of data grows, we need more efficient data analysis methods that can scale towards increasingly large and high dimensional data sets. We develop a streaming clustering algorithm that is highly amenable to hardware acceleration. Our algorithm eliminates the need to store the data objects, which removes limits on the size of the data that we can analyze. Our algorithm is highly parameterizable, which allows it to fit to the characteristics of the data set, and scale towards the available hardware resources. Our streaming hardware core can handle more than 40 Msamples/s when processing 3-dimensional streaming data and up to 1.78 Msamples/s for 70-dimensional data. To validate the accuracy and performance of our algorithms we compare it with several common clustering techniques on several different applications. The experimental result shows that it outperforms other prior hardware accelerated clustering systems.","PeriodicalId":126686,"journal":{"name":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"519 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAD.2017.8203845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Data clustering is a fundamental challenge in data analytics. It is the main task in exploratory data mining and a core technique in machine learning. As the volume, variety, velocity, and variability of data grows, we need more efficient data analysis methods that can scale towards increasingly large and high dimensional data sets. We develop a streaming clustering algorithm that is highly amenable to hardware acceleration. Our algorithm eliminates the need to store the data objects, which removes limits on the size of the data that we can analyze. Our algorithm is highly parameterizable, which allows it to fit to the characteristics of the data set, and scale towards the available hardware resources. Our streaming hardware core can handle more than 40 Msamples/s when processing 3-dimensional streaming data and up to 1.78 Msamples/s for 70-dimensional data. To validate the accuracy and performance of our algorithms we compare it with several common clustering techniques on several different applications. The experimental result shows that it outperforms other prior hardware accelerated clustering systems.