Big Data Analytics: Performance Evaluation for High Availability and Fault Tolerance using MapReduce Framework with HDFS

2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC) Pub Date : 2018-12-01 DOI:10.1109/PDGC.2018.8745770

J. P. Verma, Sapan H. Mankad, Sanjay Garg

{"title":"Big Data Analytics: Performance Evaluation for High Availability and Fault Tolerance using MapReduce Framework with HDFS","authors":"J. P. Verma, Sapan H. Mankad, Sanjay Garg","doi":"10.1109/PDGC.2018.8745770","DOIUrl":null,"url":null,"abstract":"Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.","PeriodicalId":303401,"journal":{"name":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2018.8745770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Big data analytics helps in analyzing structured data transaction and analytics programs that contain semi-structured and unstructured data. Internet clickstream data, mobile-phone call details, server logs are examples of big data. Relational database-oriented dataset doesn't fit in traditional data warehouse since big data set is updated frequently and large amount of data are generated in real time. Many open source solutions are available for handling this large scale data. The Hadoop Distributed File System (HDFS) is one of the solutions which helps in storing, managing, and analyzing big data. Hadoop has become a standard for distributed storage and computing in Big Data Analytic applications. It has the capability to manage distributed nodes for data storage and processing in distributed manner. Hadoop architecture is also known as Store everything now and decide how to process later. Challenges and issues of multi-node Hadoop cluster setup and configuration are discussed in this paper. The troubleshooting for high availability of nodes in different scenarios for Hadoop cluster failure are experimented with different sizes of datasets. Experimental analysis carried out in this paper helps to improve uses of Hadoop cluster effectively for research and analysis. It also provides suggestions for selecting size of Hadoop cluster as per data size and generation speed.

查看原文本刊更多论文

大数据分析:使用MapReduce框架和HDFS实现高可用性和容错的性能评估

大数据分析有助于分析结构化数据交易和包含半结构化和非结构化数据的分析程序。互联网点击流数据、手机通话细节、服务器日志都是大数据的例子。由于大数据集更新频繁，实时生成大量数据，面向关系数据库的数据集不适合传统的数据仓库。有许多开源解决方案可用于处理这种大规模数据。Hadoop分布式文件系统(HDFS)是帮助存储、管理和分析大数据的解决方案之一。Hadoop已经成为大数据分析应用中分布式存储和计算的标准。它具有以分布式方式管理分布式节点进行数据存储和处理的能力。Hadoop架构也被称为“先存储一切，再决定以后如何处理”。本文讨论了多节点Hadoop集群设置和配置的挑战和问题。针对Hadoop集群故障的不同场景，采用不同规模的数据集进行了节点高可用性故障排除实验。本文进行的实验分析有助于有效地改进Hadoop集群的使用，进行研究和分析。它还提供了根据数据大小和生成速度选择Hadoop集群大小的建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)

自引率

0.00%

发文量