改进的基于内存计算框架的购物篮分析

2017 11th International Conference on Intelligent Systems and Control (ISCO) Pub Date : 1900-01-01 DOI:10.1109/ISCO.2017.7855955

Thanmayee, H. Prasad

{"title":"改进的基于内存计算框架的购物篮分析","authors":"Thanmayee, H. Prasad","doi":"10.1109/ISCO.2017.7855955","DOIUrl":null,"url":null,"abstract":"Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.","PeriodicalId":321113,"journal":{"name":"2017 11th International Conference on Intelligent Systems and Control (ISCO)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Revamped Market-Basket Analysis using In-Memory Computation framework\",\"authors\":\"Thanmayee, H. Prasad\",\"doi\":\"10.1109/ISCO.2017.7855955\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.\",\"PeriodicalId\":321113,\"journal\":{\"name\":\"2017 11th International Conference on Intelligent Systems and Control (ISCO)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 11th International Conference on Intelligent Systems and Control (ISCO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCO.2017.7855955\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 11th International Conference on Intelligent Systems and Control (ISCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCO.2017.7855955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

数据集每天都在增长，因为它们被信息传感设备捕获，如手机、计算机、无线传感器网络、相机、软件日志、网络日志、遥感等各个领域，如医疗、工程、科学等。这些大数据集现在被称为大数据。处理大数据并不是一项常见的任务。由于这个大数据集中隐藏着信息，研究人员不能也没有忽视这个大数据集。数据挖掘是计算机科学中的一个跨学科领域，它从数据中提取信息或隐藏模式。关联规则挖掘和频繁项集挖掘是常用的数据挖掘技术，它们需要将整个数据存储在主内存中。但是大型数据集不适合放在主存储器中。为了解决这个问题，使用了Hadoop MapReduce方法，该方法具有可伸缩性和鲁棒性，可以处理大型数据集。Apriori、Eclat和FP Growth是众所周知的频繁项集挖掘算法。这些算法被修改为使用Hadoop MapReduce处理大数据。但MapReduce框架存在将中间数据存储在本地磁盘等问题。因此需要从本地磁盘访问数据，这就导致了高延迟问题。为了解决这个问题，Spark遵循了一个通用的执行模型，该模型有助于内存计算和任意运算符图的优化，因此与基于磁盘的引擎(如MapReduce)相比，查询数据变得更快。因此，本文着重于利用Apache Spark架构增强频繁项集挖掘的性能，并通过与Hadoop MapReduce实现的频繁项集挖掘任务BigFIM以及不同数据集的比较，研究了基于FP-Growth的改进购物篮分析的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revamped Market-Basket Analysis using In-Memory Computation framework

Data sets are growing day by day as they are being captured by information sensing devices such as mobiles, computers, wireless sensor networks, cameras, software logs, weblogs, remote sensing in various fields such as medical, engineering, science and many more. These large data sets are now called Big Data. Working with Big Data is not a common task. As this large data set has information hidden within them, researchers cannot and they have not ignored the large data set. Data mining is an interdisciplinary field in Computer Science which extracts information or the hidden patterns from data. Association rule mining and frequent itemset mining are popular data mining techniques that requires entire data to be in main memory. But large datasets does not fit into main memory. To handle this drawback, Hadoop MapReduce approach is used which has scalability and robustness features to handle large datasets. Apriori, Eclat and FP Growth are well known Frequent Itemset Mining algorithms. These algorithms are revised to work with Big Data using Hadoop MapReduce. But MapReduce framework has problems such as it stores the intermediate data in local disk. So the data needs to be accessed from the local disk which results in high latency problem. To address this issue Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce. Thus the paper focuses on enhancing the performance of Frequent Itemset Mining using Apache Spark architecture and study the performance of this Revamped Market Basket Analysis based on FP-Growth by comparing it with Hadoop MapReduce implementation of Frequent Itemset Mining task, BigFIM and also with different datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 11th International Conference on Intelligent Systems and Control (ISCO)

自引率

0.00%

发文量