大数据对IR、文本挖掘和NLP的机遇和挑战

Beth Plale
{"title":"大数据对IR、文本挖掘和NLP的机遇和挑战","authors":"Beth Plale","doi":"10.1145/2513549.2514739","DOIUrl":null,"url":null,"abstract":"Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Big data opportunities and challenges for IR, text mining and NLP\",\"authors\":\"Beth Plale\",\"doi\":\"10.1145/2513549.2514739\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.\",\"PeriodicalId\":126426,\"journal\":{\"name\":\"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2513549.2514739\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2513549.2514739","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

大数据由于其数据量、准确性和速度的特点,对文本分析和自然语言处理提出了挑战。庞大的文档数量对传统的本地存储库和索引系统进行大规模分析和挖掘提出了挑战。计算、存储和数据表示必须协同工作,以提供对大型文本集合中深度知识的快速访问、搜索和挖掘。受版权保护的文本对计算访问构成了额外的障碍,其中分析必须与原始文本的人类消费分开。在大多数情况下,数据预处理仍然是大文本数据的一项艰巨任务,特别是由于原始材料的年代久远,数据的准确性受到质疑。数据速度是数据的变化率,但也可以是进行更改和更正的速率。HathiTrust研究中心(HTRC)为IR, NLP和文本挖掘研究提供了新的机会。HTRC是HathiTrust的研究机构,HathiTrust是一个管理来自全国各地研究图书馆的数字内容图书馆的联盟。HathiTrust拥有近1100万册藏书,HTRC旨在为这些文本资源提供大规模的计算访问和分析。以促进学者的工作为目标,HTRC建立了一个由软件、人员和服务组成的网络基础设施,以帮助研究人员和开发人员更轻松、有效地处理和挖掘大规模文本数据。HTRC的主要用户是数字人文、信息学和图书馆员。他们具有不同的研究背景和专业知识,因此可以使用各种工具。在HTRC计算模型中,计算转移到数据上,服务围绕语料库发展起来,为研究社区服务。通过这种方式,架构是基于云的。将算法移到数据上是很重要的,因为受版权保护的内容必须受到保护,然而,这种范式的一个附带好处是使学者不必担心管理大量数据的问题。HTRC目前支持的文本分析是分析算法的SEASR套件(www.seasr.org)。以工作流形式编写的SEASR算法包括实体提取、标签云、主题建模、朴素贝叶斯、日期实体到相似时间轴。在这次演讲中,我将介绍HTRC的集合、架构和文本分析,重点关注大数据语料库的挑战,以及这对数据存储、访问和大规模计算意味着什么。HTRC正在建立一个用户社区,以便更好地理解和支持研究人员的需求。它为NLP、文本挖掘、IR类型的研究打开了许多令人兴奋的可能性:有了如此大量的文本数据和许多候选算法,在研究者贡献算法的支持下,许多有趣的研究问题出现了,许多有趣的结果也随之而来。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Big data opportunities and challenges for IR, text mining and NLP
Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信