Salton Award Lecture - Information retrieval and computer science: an evolving relationship

W. Bruce Croft
{"title":"Salton Award Lecture - Information retrieval and computer science: an evolving relationship","authors":"W. Bruce Croft","doi":"10.1145/860435.860437","DOIUrl":null,"url":null,"abstract":"Following the tradition of these acceptance talks, I will be giving my thoughts on where our field is going. Any discussion of the future of information retrieval (IR) research, however, needs to be placed in the context of its history and relationship to other fields. Although IR has had a very strong relationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR. Early ACM curriculum recommendations for CS contained courses on information retrieval, and encyclopedias described IR and database systems as different aspects of the same field. By the 70s, there were only a few IR researchers in CS departments in the U.S., database systems was a separate (and thriving) field, and many felt that IR had stagnated and was largely irrelevant. The truth, in fact, was far from that. The IR research community was a small, but dedicated, group of researchers in the U.S. and Europe who were motivated by a desire to understand the process of information retrieval and to build systems that would help people find the right information in text databases. This was (and is) a hard goal and led to different evaluation metrics and methodologies than the database community. Progress in the field was hampered by a lack of large-scale testbeds and tests were limited to databases containing at most a few hundred document abstracts. In the 80s AI boom, IR was still not a mainstream area, despite its focus on a human task involving natural language. IR focused on a statistical approach to language rather than the much more popular knowledge-based approach. The fact that IR conferences mix papers on effectiveness as measured by human judgments with papers measuring performance of file organizations for large-scale systems has meant that IR has always been difficult to classify into simple categories such as \"systems\" or \"AI\" that are often used in CS departments. Since the early 90s, just about everything has changed. Large, full-text databases were finally made available for experimentation through DARPA funding and TREC. This has had an enormous positive impact on the quantity and quality of IR research. The advent of the Web search engine has validated the longstanding claims made by IR researchers that simple queries and ranking were the right techniques for information access in a largely unstructured information world. What has not changed is that there are still relatively few IR researchers in CS departments. There are, however, many more people in CS departments doing IR-related research, which is just about the same thing. Conferences in databases, machine learning, computational linguistics, and data mining publish a number of IR papers done by people who would not primarily consider themselves as IR researchers. Given that there is an increasing diffusion of IR ideas into the CS community, it is worth stating what IR, as a field of CS, has accomplished: Search engines have become the infrastructure for much of information access in our society. IR has provided the basic research on the algorithms and data structures for these engines, and continues to develop new capabilities such as cross-lingual search, distributed search, question answering, and topic detection and tracking. IR championed the statistical approach to language long before it was accepted by other researchers working on language technologies. Statistical NLP is now mainstream and results from that field are being used to improve IR systems (in question answering, for example). IR focused on evaluation as a research area, and developed an evaluation methodology based on large, standardized testbeds and comparison with human judgments that has been adopted by researchers in a number of other language technology areas. IR, because of its focus on measuring success based on human judgments, has always acknowledged the importance of the user and interaction as a part of information access. This led to a number of contributions to the design of query and search interfaces and learning techniques based on user feedback. Although these achievements are important, the long-term goals of the IR field have not yet been met. What are those goals? One possibility that is often mentioned is the MEMEX of Vannevar Bush [1]. Another, more recent, statement of long-term challenges was made in the report of the IR Challenges Workshop [2]: Global Information Access: Satisfy human information needs through natural, efficient interaction with an automated system that leverages world-wide structured and unstructured data in any language. Contextual Retrieval; Combine search technologies and knowledge about query and user context into a single framework in order to provide the most appropriate answer for a user's information need. These goals are, in fact, very similar to long-term challenges coming out of other CS fields. For example, Jim Gray, a Turing Award winner from the database area, mentioned in his address a personal and world MEMEX as long-term goals for his field and CS in general [3]. IR's long-term goals are clearly important long-term goals for the whole of CS, and achieving those goals will involve everyone interested in the general area of information management and retrieval. Rather than talking about what IR can do in isolation to progress towards its goals, I would prefer to talk about what IR can do in collaboration with other areas. There are many examples of potential collaborative research areas. Collaborations with researchers from the NLP and information extraction communities have been developing for some time in order to study topics such as advanced question answering. On the other hand, not enough has been done to work with the database community to develop probabilistic retrieval models for unstructured, semi-structured, and structured data. There have been a number of attempts to combine IR and database functionality, none of which has been particularly successful. Most recently, some groups have been working on combining IR search with XML documents, but what is needed is a comprehensive examination of the issues and problems by teams from both areas working together, and the creation of new testbeds that can be used to evaluate proposed models. The time is right for such collaborations. Another example of where database, IR, and networking people can work together is in the development of distributed, heterogeneous information systems. This requires significant new research in areas like peer-to-peer architectures, semantic heterogeneity, automatic metadata generation, and retrieval models. If the information systems described above are extended to include new data types such as video, images, sound, and the whole range of scientific data (such as from the biosciences, geoscience, and astronomy), then a broad range of new challenges are added that need to be tackled in collaboration with people who know about these types of data. There should also be more cooperation between the data mining, IR, and summarization communities to tackle the core problem of defining what is new and interesting in streams of data. These and other similar collaborations will the basis for the future development of the IR field. We will continue to work on research problems that specifically interest us, but this research will increasingly be in the context of larger efforts. IR concepts and IR research will be an important part of the evolving mix of CS expertise that will be used to solve the \"grand\" challenges.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/860435.860437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Following the tradition of these acceptance talks, I will be giving my thoughts on where our field is going. Any discussion of the future of information retrieval (IR) research, however, needs to be placed in the context of its history and relationship to other fields. Although IR has had a very strong relationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR. Early ACM curriculum recommendations for CS contained courses on information retrieval, and encyclopedias described IR and database systems as different aspects of the same field. By the 70s, there were only a few IR researchers in CS departments in the U.S., database systems was a separate (and thriving) field, and many felt that IR had stagnated and was largely irrelevant. The truth, in fact, was far from that. The IR research community was a small, but dedicated, group of researchers in the U.S. and Europe who were motivated by a desire to understand the process of information retrieval and to build systems that would help people find the right information in text databases. This was (and is) a hard goal and led to different evaluation metrics and methodologies than the database community. Progress in the field was hampered by a lack of large-scale testbeds and tests were limited to databases containing at most a few hundred document abstracts. In the 80s AI boom, IR was still not a mainstream area, despite its focus on a human task involving natural language. IR focused on a statistical approach to language rather than the much more popular knowledge-based approach. The fact that IR conferences mix papers on effectiveness as measured by human judgments with papers measuring performance of file organizations for large-scale systems has meant that IR has always been difficult to classify into simple categories such as "systems" or "AI" that are often used in CS departments. Since the early 90s, just about everything has changed. Large, full-text databases were finally made available for experimentation through DARPA funding and TREC. This has had an enormous positive impact on the quantity and quality of IR research. The advent of the Web search engine has validated the longstanding claims made by IR researchers that simple queries and ranking were the right techniques for information access in a largely unstructured information world. What has not changed is that there are still relatively few IR researchers in CS departments. There are, however, many more people in CS departments doing IR-related research, which is just about the same thing. Conferences in databases, machine learning, computational linguistics, and data mining publish a number of IR papers done by people who would not primarily consider themselves as IR researchers. Given that there is an increasing diffusion of IR ideas into the CS community, it is worth stating what IR, as a field of CS, has accomplished: Search engines have become the infrastructure for much of information access in our society. IR has provided the basic research on the algorithms and data structures for these engines, and continues to develop new capabilities such as cross-lingual search, distributed search, question answering, and topic detection and tracking. IR championed the statistical approach to language long before it was accepted by other researchers working on language technologies. Statistical NLP is now mainstream and results from that field are being used to improve IR systems (in question answering, for example). IR focused on evaluation as a research area, and developed an evaluation methodology based on large, standardized testbeds and comparison with human judgments that has been adopted by researchers in a number of other language technology areas. IR, because of its focus on measuring success based on human judgments, has always acknowledged the importance of the user and interaction as a part of information access. This led to a number of contributions to the design of query and search interfaces and learning techniques based on user feedback. Although these achievements are important, the long-term goals of the IR field have not yet been met. What are those goals? One possibility that is often mentioned is the MEMEX of Vannevar Bush [1]. Another, more recent, statement of long-term challenges was made in the report of the IR Challenges Workshop [2]: Global Information Access: Satisfy human information needs through natural, efficient interaction with an automated system that leverages world-wide structured and unstructured data in any language. Contextual Retrieval; Combine search technologies and knowledge about query and user context into a single framework in order to provide the most appropriate answer for a user's information need. These goals are, in fact, very similar to long-term challenges coming out of other CS fields. For example, Jim Gray, a Turing Award winner from the database area, mentioned in his address a personal and world MEMEX as long-term goals for his field and CS in general [3]. IR's long-term goals are clearly important long-term goals for the whole of CS, and achieving those goals will involve everyone interested in the general area of information management and retrieval. Rather than talking about what IR can do in isolation to progress towards its goals, I would prefer to talk about what IR can do in collaboration with other areas. There are many examples of potential collaborative research areas. Collaborations with researchers from the NLP and information extraction communities have been developing for some time in order to study topics such as advanced question answering. On the other hand, not enough has been done to work with the database community to develop probabilistic retrieval models for unstructured, semi-structured, and structured data. There have been a number of attempts to combine IR and database functionality, none of which has been particularly successful. Most recently, some groups have been working on combining IR search with XML documents, but what is needed is a comprehensive examination of the issues and problems by teams from both areas working together, and the creation of new testbeds that can be used to evaluate proposed models. The time is right for such collaborations. Another example of where database, IR, and networking people can work together is in the development of distributed, heterogeneous information systems. This requires significant new research in areas like peer-to-peer architectures, semantic heterogeneity, automatic metadata generation, and retrieval models. If the information systems described above are extended to include new data types such as video, images, sound, and the whole range of scientific data (such as from the biosciences, geoscience, and astronomy), then a broad range of new challenges are added that need to be tackled in collaboration with people who know about these types of data. There should also be more cooperation between the data mining, IR, and summarization communities to tackle the core problem of defining what is new and interesting in streams of data. These and other similar collaborations will the basis for the future development of the IR field. We will continue to work on research problems that specifically interest us, but this research will increasingly be in the context of larger efforts. IR concepts and IR research will be an important part of the evolving mix of CS expertise that will be used to solve the "grand" challenges.
索尔顿奖讲座-信息检索和计算机科学:一种不断发展的关系
按照这些获奖感言的传统,我将谈谈我们这个领域的发展方向。然而,任何关于信息检索(IR)研究未来的讨论都需要放在它的历史背景和与其他领域的关系中。虽然信息科学与图书馆和信息科学的关系非常密切,但它与计算机科学(CS)的关系以及它作为CS的一个分支学科的相对地位更加动态。IR是一个相当古老的领域,当60年代许多CS系成立时,一位教员从事与IR相关的研究并不罕见。早期ACM对计算机科学的课程建议包含信息检索课程,百科全书将IR和数据库系统描述为同一领域的不同方面。到了70年代,美国计算机科学系只有少数IR研究人员,数据库系统是一个独立的(并且蓬勃发展的)领域,许多人认为IR已经停滞不前,并且在很大程度上无关紧要。事实上,事实远非如此。IR研究社区是美国和欧洲的一个小而专注的研究小组,他们的动机是理解信息检索的过程,并建立能够帮助人们在文本数据库中找到正确信息的系统。这过去是(现在也是)一个艰难的目标,并导致了与数据库社区不同的评估指标和方法。由于缺乏大规模的测试平台,该领域的进展受到阻碍,而且测试仅限于最多包含几百个文档摘要的数据库。在80年代的人工智能热潮中,尽管人工智能关注的是涉及自然语言的人类任务,但它仍然不是一个主流领域。IR侧重于语言的统计方法,而不是更流行的基于知识的方法。事实上,IR会议将关于人类判断的有效性的论文与测量大型系统文件组织性能的论文混合在一起,这意味着IR一直很难归类为简单的类别,例如CS部门经常使用的“系统”或“人工智能”。自90年代初以来,几乎一切都发生了变化。通过DARPA和TREC的资助,大型的全文数据库终于可以用于实验。这对IR研究的数量和质量产生了巨大的积极影响。Web搜索引擎的出现证实了IR研究人员长期以来的主张,即在一个很大程度上非结构化的信息世界中,简单的查询和排序是信息访问的正确技术。没有改变的是,计算机科学系的IR研究人员仍然相对较少。然而,计算机科学系有更多的人在做与红外相关的研究,这几乎是一样的。数据库、机器学习、计算语言学和数据挖掘领域的会议发表了许多IR论文,这些论文的作者并不认为自己是IR研究人员。考虑到IR思想在CS社区中的传播越来越多,有必要说明IR作为CS的一个领域所取得的成就:搜索引擎已经成为我们社会中许多信息访问的基础设施。IR为这些引擎提供了关于算法和数据结构的基础研究,并继续开发新的功能,如跨语言搜索、分布式搜索、问题回答以及主题检测和跟踪。早在其他研究语言技术的研究人员接受统计方法之前,IR就倡导了统计方法。统计NLP现在是主流,该领域的结果正被用于改进IR系统(例如,在问题回答中)。IR将评估作为一个研究领域,并开发了一种基于大型标准化测试平台和与人类判断比较的评估方法,该方法已被许多其他语言技术领域的研究人员采用。由于IR侧重于基于人的判断来衡量成功,因此一直承认用户和交互作为信息获取的一部分的重要性。这导致了对查询和搜索界面的设计以及基于用户反馈的学习技术的许多贡献。虽然这些成就很重要,但红外领域的长期目标尚未实现。这些目标是什么?一种经常被提及的可能性是Vannevar Bush的MEMEX[1]。国际关系挑战研讨会[2]的报告中提出了另一个更近期的长期挑战陈述:全球信息访问:通过与利用任何语言的全球结构化和非结构化数据的自动化系统进行自然、有效的交互来满足人类的信息需求。 上下文检索;将搜索技术和关于查询和用户上下文的知识结合到一个框架中,以便为用户的信息需求提供最合适的答案。事实上,这些目标与其他计算机科学领域的长期挑战非常相似。例如,数据库领域的图灵奖得主Jim Gray在他的演讲中提到了个人和世界MEMEX作为他的领域和CS的长期目标[3]。IR的长期目标显然是整个CS的重要长期目标,实现这些目标将涉及到对信息管理和检索领域感兴趣的每个人。与其谈论IR可以单独做些什么来实现其目标,我更愿意谈论IR可以与其他领域合作做些什么。有许多潜在的合作研究领域的例子。为了研究高级问答等主题,与NLP和信息提取社区的研究人员的合作已经发展了一段时间。另一方面,在与数据库社区合作开发非结构化、半结构化和结构化数据的概率检索模型方面做得还不够。已经有很多尝试将IR和数据库功能结合起来,但没有一个特别成功。最近,一些小组一直致力于将IR搜索与XML文档结合起来,但是需要的是由两个领域的团队一起工作,对问题和问题进行全面的检查,并创建可用于评估所建议模型的新测试平台。现在正是开展此类合作的好时机。数据库、IR和网络人员可以协同工作的另一个例子是分布式异构信息系统的开发。这需要在对等体系结构、语义异构、自动元数据生成和检索模型等领域进行重要的新研究。如果将上面描述的信息系统扩展到包括新的数据类型,如视频、图像、声音和整个科学数据范围(如来自生物科学、地球科学和天文学),那么就增加了广泛的新挑战,需要与了解这些数据类型的人合作解决。数据挖掘、IR和总结社区之间也应该有更多的合作,以解决定义数据流中什么是新的和有趣的核心问题。这些和其他类似的合作将成为红外领域未来发展的基础。我们将继续研究我们特别感兴趣的问题,但这项研究将越来越多地放在更大努力的背景下。红外概念和红外研究将是不断发展的计算机科学专业知识组合的重要组成部分,将用于解决“重大”挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信