Distributed pattern matching and document analysis in big data using Hadoop MapReduce model

2014 International Conference on Parallel, Distributed and Grid Computing Pub Date : 2014-12-01 DOI:10.1109/PDGC.2014.7030762

A. Ramya, E. Sivasankar

{"title":"Distributed pattern matching and document analysis in big data using Hadoop MapReduce model","authors":"A. Ramya, E. Sivasankar","doi":"10.1109/PDGC.2014.7030762","DOIUrl":null,"url":null,"abstract":"Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.","PeriodicalId":311953,"journal":{"name":"2014 International Conference on Parallel, Distributed and Grid Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Parallel, Distributed and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2014.7030762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.

查看原文本刊更多论文

基于Hadoop MapReduce模型的大数据分布式模式匹配与文档分析

顺序模式挖掘和文档分析是大数据中一个重要的数据挖掘问题，具有广泛的应用前景。本文研究了一个在数据集模式匹配和文档分析环境下管理分布式处理的特定框架。Hadoop集群上的MapReduce编程模型具有高度可扩展性，并且可以与具有集成容错机制的商用机器一起工作。本文在Hadoop分布式文件系统的帮助下，提出了一种基于Knuth Morris Pratt的分布式环境下的顺序模式匹配方法，实现了对顺序模式的高效挖掘。它还研究了文本文档数据集的分区和聚类的可行性，用于文档比较。它简化了搜索空间，获得了较高的挖掘效率。数据挖掘任务被分解为许多Map任务，并分布到许多Task跟踪器中。map任务查找中间结果并发送给reduce任务，reduce任务合并最终结果。理论分析和实验结果表明，主要基于时间要求的MapReduce模型是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Parallel, Distributed and Grid Computing

自引率

0.00%

发文量