M. Sarwar, M. Alexander, J. Anderson, J. Green, R. Sinnott
{"title":"Implementing MapReduce over language and literature data over the UK National Grid Service","authors":"M. Sarwar, M. Alexander, J. Anderson, J. Green, R. Sinnott","doi":"10.1109/ICET.2011.6048475","DOIUrl":null,"url":null,"abstract":"Humanities researchers are producing large volumes and heterogeneous varieties of language and literature data collections in digital format. These collections include dictionaries, thesauri, corpora, images, audio and video resources. The increased availability of these datasets brought about by advances and adaptations of the Internet and increased digitisation of humanities data resources, poses new challenges for humanities researchers. Many of these challenges are related to data access and usage and include security, integrity, interoperability, information retrieval, sharing, licensing and copyright. The JISC-funded project Enhancing Repositories for Language and Literature Research (ENROLLER; https://www.enroller.org.uk) is addressing these issues through development of a targeted e-Research environment. A key component of this effort is in supporting large-scale analysis of diverse language and literature data sets. To this end, this paper presents the application of the MapReduce algorithm, that supports information retrieval and linguistic analysis on those datasets. In particular, we describe how MapReduce is used to provide advanced bulk search capabilities exploiting a range of high performance computing resources including the UK National Grid Service (www.ngs.ac.uk) and ScotGrid (www.scotgrid.ac.uk) to offer a step change in the kinds of research that can be undertaken by this community. We also present performance analysis results based on the application of these systems.","PeriodicalId":167049,"journal":{"name":"2011 7th International Conference on Emerging Technologies","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 7th International Conference on Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2011.6048475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Humanities researchers are producing large volumes and heterogeneous varieties of language and literature data collections in digital format. These collections include dictionaries, thesauri, corpora, images, audio and video resources. The increased availability of these datasets brought about by advances and adaptations of the Internet and increased digitisation of humanities data resources, poses new challenges for humanities researchers. Many of these challenges are related to data access and usage and include security, integrity, interoperability, information retrieval, sharing, licensing and copyright. The JISC-funded project Enhancing Repositories for Language and Literature Research (ENROLLER; https://www.enroller.org.uk) is addressing these issues through development of a targeted e-Research environment. A key component of this effort is in supporting large-scale analysis of diverse language and literature data sets. To this end, this paper presents the application of the MapReduce algorithm, that supports information retrieval and linguistic analysis on those datasets. In particular, we describe how MapReduce is used to provide advanced bulk search capabilities exploiting a range of high performance computing resources including the UK National Grid Service (www.ngs.ac.uk) and ScotGrid (www.scotgrid.ac.uk) to offer a step change in the kinds of research that can be undertaken by this community. We also present performance analysis results based on the application of these systems.