改进的基于内存计算框架的购物篮分析

Thanmayee, H. Prasad
{"title":"改进的基于内存计算框架的购物篮分析","authors":"Thanmayee, H. Prasad","doi":"10.1109/ISCO.2017.7855955","DOIUrl":null,"url":null,"abstract":"Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.","PeriodicalId":321113,"journal":{"name":"2017 11th International Conference on Intelligent Systems and Control (ISCO)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Revamped Market-Basket Analysis using In-Memory Computation framework\",\"authors\":\"Thanmayee, H. Prasad\",\"doi\":\"10.1109/ISCO.2017.7855955\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.\",\"PeriodicalId\":321113,\"journal\":{\"name\":\"2017 11th International Conference on Intelligent Systems and Control (ISCO)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 11th International Conference on Intelligent Systems and Control (ISCO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCO.2017.7855955\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 11th International Conference on Intelligent Systems and Control (ISCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCO.2017.7855955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

数据集每天都在增长,因为它们被信息传感设备捕获,如手机、计算机、无线传感器网络、相机、软件日志、网络日志、遥感等各个领域,如医疗、工程、科学等。这些大数据集现在被称为大数据。处理大数据并不是一项常见的任务。由于这个大数据集中隐藏着信息,研究人员不能也没有忽视这个大数据集。数据挖掘是计算机科学中的一个跨学科领域,它从数据中提取信息或隐藏模式。关联规则挖掘和频繁项集挖掘是常用的数据挖掘技术,它们需要将整个数据存储在主内存中。但是大型数据集不适合放在主存储器中。为了解决这个问题,使用了Hadoop MapReduce方法,该方法具有可伸缩性和鲁棒性,可以处理大型数据集。Apriori、Eclat和FP Growth是众所周知的频繁项集挖掘算法。这些算法被修改为使用Hadoop MapReduce处理大数据。但MapReduce框架存在将中间数据存储在本地磁盘等问题。因此需要从本地磁盘访问数据,这就导致了高延迟问题。为了解决这个问题,Spark遵循了一个通用的执行模型,该模型有助于内存计算和任意运算符图的优化,因此与基于磁盘的引擎(如MapReduce)相比,查询数据变得更快。因此,本文着重于利用Apache Spark架构增强频繁项集挖掘的性能,并通过与Hadoop MapReduce实现的频繁项集挖掘任务BigFIM以及不同数据集的比较,研究了基于FP-Growth的改进购物篮分析的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Revamped Market-Basket Analysis using In-Memory Computation framework
Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信