Big Data Analytics: Performance Evaluation for High Availability and Fault Tolerance using MapReduce Framework with HDFS

J. P. Verma, Sapan H. Mankad, Sanjay Garg
{"title":"Big Data Analytics: Performance Evaluation for High Availability and Fault Tolerance using MapReduce Framework with HDFS","authors":"J. P. Verma, Sapan H. Mankad, Sanjay Garg","doi":"10.1109/PDGC.2018.8745770","DOIUrl":null,"url":null,"abstract":"Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.","PeriodicalId":303401,"journal":{"name":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2018.8745770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.
大数据分析:使用MapReduce框架和HDFS实现高可用性和容错的性能评估
大数据分析有助于分析结构化数据交易和包含半结构化和非结构化数据的分析程序。互联网点击流数据、手机通话细节、服务器日志都是大数据的例子。由于大数据集更新频繁,实时生成大量数据,面向关系数据库的数据集不适合传统的数据仓库。有许多开源解决方案可用于处理这种大规模数据。Hadoop分布式文件系统(HDFS)是帮助存储、管理和分析大数据的解决方案之一。Hadoop已经成为大数据分析应用中分布式存储和计算的标准。它具有以分布式方式管理分布式节点进行数据存储和处理的能力。Hadoop架构也被称为“先存储一切,再决定以后如何处理”。本文讨论了多节点Hadoop集群设置和配置的挑战和问题。针对Hadoop集群故障的不同场景,采用不同规模的数据集进行了节点高可用性故障排除实验。本文进行的实验分析有助于有效地改进Hadoop集群的使用,进行研究和分析。它还提供了根据数据大小和生成速度选择Hadoop集群大小的建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信