Scalable and Distributed Processing of Scientific XML Data

2011 IEEE/ACM 12th International Conference on Grid Computing Pub Date : 2011-09-21 DOI:10.1109/Grid.2011.24

Elif Dede, Zacharia Fadika, Chaitali Gupta, M. Govindaraju

{"title":"Scalable and Distributed Processing of Scientific XML Data","authors":"Elif Dede, Zacharia Fadika, Chaitali Gupta, M. Govindaraju","doi":"10.1109/Grid.2011.24","DOIUrl":null,"url":null,"abstract":"A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.","PeriodicalId":308086,"journal":{"name":"2011 IEEE/ACM 12th International Conference on Grid Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE/ACM 12th International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Grid.2011.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.

查看原文本刊更多论文

科学XML数据的可扩展和分布式处理

为科学实验产生的大量数据集提供无缝和直观的搜索能力对于确保特定领域科学家有效使用这些数据至关重要。目前，对大量XML数据集的搜索是通过自定义脚本手动完成的，或者使用由复杂和不同的XML查询语言的专家开发的难以自定义的查询来完成。然而，这种方法不能为大规模数据提供可接受的性能，因为它们不是基于可扩展的分布式解决方案。此外，有研究表明，数据库没有针对科学实验生成的XML数据的查询进行优化，因为需要考虑术语亲缘关系、基于范围的查询以及连接和否定等约束。目前迫切需要一个易于使用和可扩展的框架，专门用于科学数据，提供类似自然语言的语法以及准确的结果。由于大多数现有的搜索工具都是为精确字符串匹配而设计的，这不足以满足科学需求，我们相信这样一个框架将通过它所提供的数据简化能力来提高科学研究的生产力和质量。本文介绍了MapReduce模型应该如何用于科学数据集的XML元数据索引，特别是TeraGrid信息服务和由散裂中子源(SNS)科学家生成的NeXus数据集。我们提出了一个索引结构，可以很好地扩展大规模MapReduce处理。我们展示了使用两个MapReduce实现(Apache Hadoop和LEMO-MR)的性能结果，以强调我们的框架在不同MapReduce环境中的灵活性和适应性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE/ACM 12th International Conference on Grid Computing

自引率

0.00%

发文量