Implementing MapReduce over language and literature data over the UK National Grid Service

M. Sarwar, M. Alexander, J. Anderson, J. Green, R. Sinnott
{"title":"Implementing MapReduce over language and literature data over the UK National Grid Service","authors":"M. Sarwar, M. Alexander, J. Anderson, J. Green, R. Sinnott","doi":"10.1109/ICET.2011.6048475","DOIUrl":null,"url":null,"abstract":"Humanities researchers are producing large volumes and heterogeneous varieties of language and literature data collections in digital format. These collections include dictionaries, thesauri, corpora, images, audio and video resources. The increased availability of these datasets brought about by advances and adaptations of the Internet and increased digitisation of humanities data resources, poses new challenges for humanities researchers. Many of these challenges are related to data access and usage and include security, integrity, interoperability, information retrieval, sharing, licensing and copyright. The JISC-funded project Enhancing Repositories for Language and Literature Research (ENROLLER; https://www.enroller.org.uk) is addressing these issues through development of a targeted e-Research environment. A key component of this effort is in supporting large-scale analysis of diverse language and literature data sets. To this end, this paper presents the application of the MapReduce algorithm, that supports information retrieval and linguistic analysis on those datasets. In particular, we describe how MapReduce is used to provide advanced bulk search capabilities exploiting a range of high performance computing resources including the UK National Grid Service (www.ngs.ac.uk) and ScotGrid (www.scotgrid.ac.uk) to offer a step change in the kinds of research that can be undertaken by this community. We also present performance analysis results based on the application of these systems.","PeriodicalId":167049,"journal":{"name":"2011 7th International Conference on Emerging Technologies","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 7th International Conference on Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2011.6048475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Humanities researchers are producing large volumes and heterogeneous varieties of language and literature data collections in digital format. These collections include dictionaries, thesauri, corpora, images, audio and video resources. The increased availability of these datasets brought about by advances and adaptations of the Internet and increased digitisation of humanities data resources, poses new challenges for humanities researchers. Many of these challenges are related to data access and usage and include security, integrity, interoperability, information retrieval, sharing, licensing and copyright. The JISC-funded project Enhancing Repositories for Language and Literature Research (ENROLLER; https://www.enroller.org.uk) is addressing these issues through development of a targeted e-Research environment. A key component of this effort is in supporting large-scale analysis of diverse language and literature data sets. To this end, this paper presents the application of the MapReduce algorithm, that supports information retrieval and linguistic analysis on those datasets. In particular, we describe how MapReduce is used to provide advanced bulk search capabilities exploiting a range of high performance computing resources including the UK National Grid Service (www.ngs.ac.uk) and ScotGrid (www.scotgrid.ac.uk) to offer a step change in the kinds of research that can be undertaken by this community. We also present performance analysis results based on the application of these systems.
在英国国家网格服务的语言和文学数据上实现MapReduce
人文学科研究人员正在以数字格式制作大量不同种类的语言和文学数据集。这些集合包括字典、辞典、语料库、图像、音频和视频资源。互联网的进步和适应以及人文数据资源的数字化增加了这些数据集的可用性,这给人文研究人员带来了新的挑战。其中许多挑战与数据访问和使用有关,包括安全性、完整性、互操作性、信息检索、共享、许可和版权。jsc资助的“加强语言文学研究资料库”项目(ENROLLER;https://www.enroller.org.uk)正在通过开发有针对性的电子研究环境来解决这些问题。这项工作的一个关键组成部分是支持对各种语言和文学数据集的大规模分析。为此,本文提出了MapReduce算法的应用,该算法支持对这些数据集的信息检索和语言分析。特别是,我们描述了如何使用MapReduce来提供先进的批量搜索功能,利用一系列高性能计算资源,包括英国国家电网服务(www.ngs.ac.uk)和ScotGrid (www.scotgrid.ac.uk),为该社区可以进行的研究提供了一个步骤变化。并给出了基于这些系统应用的性能分析结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信