Ahsanul Haque, Brandon Parker, L. Khan, B. Thuraisingham
{"title":"用MapReduce发展大数据流分类","authors":"Ahsanul Haque, Brandon Parker, L. Khan, B. Thuraisingham","doi":"10.1109/CLOUD.2014.82","DOIUrl":null,"url":null,"abstract":"Big Data Stream mining has some inherent challenges which are not present in traditional data mining. Not only Big Data Stream receives large volume of data continuously, but also it may have different types of features. Moreover, the concepts and features tend to evolve throughout the stream. Traditional data mining techniques are not sufficient to address these challenges. In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism. We compare each of these approaches from different aspects of design. We also empirically show that, these approaches are very useful for our base method to achieve significant scalability and speedup.","PeriodicalId":288542,"journal":{"name":"2014 IEEE 7th International Conference on Cloud Computing","volume":"99 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Evolving Big Data Stream Classification with MapReduce\",\"authors\":\"Ahsanul Haque, Brandon Parker, L. Khan, B. Thuraisingham\",\"doi\":\"10.1109/CLOUD.2014.82\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big Data Stream mining has some inherent challenges which are not present in traditional data mining. Not only Big Data Stream receives large volume of data continuously, but also it may have different types of features. Moreover, the concepts and features tend to evolve throughout the stream. Traditional data mining techniques are not sufficient to address these challenges. In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism. We compare each of these approaches from different aspects of design. We also empirically show that, these approaches are very useful for our base method to achieve significant scalability and speedup.\",\"PeriodicalId\":288542,\"journal\":{\"name\":\"2014 IEEE 7th International Conference on Cloud Computing\",\"volume\":\"99 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 7th International Conference on Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLOUD.2014.82\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 7th International Conference on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLOUD.2014.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evolving Big Data Stream Classification with MapReduce
Big Data Stream mining has some inherent challenges which are not present in traditional data mining. Not only Big Data Stream receives large volume of data continuously, but also it may have different types of features. Moreover, the concepts and features tend to evolve throughout the stream. Traditional data mining techniques are not sufficient to address these challenges. In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism. We compare each of these approaches from different aspects of design. We also empirically show that, these approaches are very useful for our base method to achieve significant scalability and speedup.