{"title":"Parallel frequent itemset mining with spark RDD framework for disease prediction","authors":"Rini Joy","doi":"10.1109/ICCPCT.2016.7530360","DOIUrl":null,"url":null,"abstract":"The aim behind frequent itemset mining is to find all common sets of items defined as those itemsets that have at least a minimum support. There are many well known algorithms for frequent itemset mining. Some of which are Apriori, Eclat, RElim, SaM, and FP-Growth. Although each of these algorithms is well formed and works in different scenarios, the main drawback of these algorithms is that they were designed to perform on small chunks of data. These limitations were imposed based on time that they were developed. The notion of big data was not up and running at these times. So in the present scenario these algorithms won't perform well on the current statistics of data present. So we propose a new approach of implementing these well known algorithms on a parallelized manner so that it can handle the data perfectly. The proposed work parallelizes, dynamic frequent itemset mining algorithm, Faster-IAPI with spark RDD framework. The main goal of selecting Apache Spark is that it overcomes the limitations of the Hadoop architecture which was basically designed to handle big data processing in a parallelized manner. The main drawback of the architecture was that it doesn't handle the Iterative algorithms very well. This drawback is rectified in spark which handles it well. In this approach this algorithm is applied to find correlation between different symptoms of patients in faster and efficient manner and provides the support for the prediction of occurrence of disease based on the symptoms.","PeriodicalId":431894,"journal":{"name":"2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCPCT.2016.7530360","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
The aim behind frequent itemset mining is to find all common sets of items defined as those itemsets that have at least a minimum support. There are many well known algorithms for frequent itemset mining. Some of which are Apriori, Eclat, RElim, SaM, and FP-Growth. Although each of these algorithms is well formed and works in different scenarios, the main drawback of these algorithms is that they were designed to perform on small chunks of data. These limitations were imposed based on time that they were developed. The notion of big data was not up and running at these times. So in the present scenario these algorithms won't perform well on the current statistics of data present. So we propose a new approach of implementing these well known algorithms on a parallelized manner so that it can handle the data perfectly. The proposed work parallelizes, dynamic frequent itemset mining algorithm, Faster-IAPI with spark RDD framework. The main goal of selecting Apache Spark is that it overcomes the limitations of the Hadoop architecture which was basically designed to handle big data processing in a parallelized manner. The main drawback of the architecture was that it doesn't handle the Iterative algorithms very well. This drawback is rectified in spark which handles it well. In this approach this algorithm is applied to find correlation between different symptoms of patients in faster and efficient manner and provides the support for the prediction of occurrence of disease based on the symptoms.