Information organization and retrieval with collaboratively generated content

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval Pub Date : 2011-07-24 DOI:10.1145/2009916.2010173

Eugene Agichtein, E. Gabrilovich

{"title":"Information organization and retrieval with collaboratively generated content","authors":"Eugene Agichtein, E. Gabrilovich","doi":"10.1145/2009916.2010173","DOIUrl":null,"url":null,"abstract":"Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting Collaboratively Generated Content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content. The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation. However, the quality and comprehensiveness of collaboratively created content vary widely, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial. The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the tutorial is on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting Collaboratively Generated Content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content. The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation. However, the quality and comprehensiveness of collaboratively created content vary widely, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial. The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the tutorial is on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.

查看原文本刊更多论文

使用协作生成的内容进行信息组织和检索

无处不在的Internet访问的激增使数百万Web用户能够在各种活动上进行在线协作。这些活动中的许多都导致了大型知识库的构建，这些知识库要么作为它们的主要目标(例如Wikipedia)，要么作为副产品(例如Yahoo!答案)。在本教程中，我们将讨论组织和利用协作生成内容(CGC)进行信息组织和检索。具体来说，我们打算涵盖这个问题的两个互补领域:(1)使用这样的内容作为一种强大的支持资源，用于丰富知识、智能表示和新的信息检索算法，以及(2)开发用于提取、过滤和组织协作创建内容的支持技术。CGC中前所未有的信息量使新的、知识丰富的信息访问方法成为可能，这些方法比传统的基于单词的方法要强大得多。在过去几年中，这方面已经取得了相当大的进展。例子包括对人类定义的概念进行显式操作，并使用它们来增加单词包(参见显式语义分析)，使用来自维基百科或开放目录项目的大规模主题分类法来构建额外的基于类的特征，或者使用维基百科来更好地消除词义歧义。但是，协作创建的内容的质量和全面性差别很大，为了使这些资源有用，需要进行大量的预处理、过滤和组织。因此，需要新的分析CGC和相应的用户交互的方法来有效地利用所得到的知识。因此，不仅可以使用内容存储库来改进IR方法，而且还可以使用反向授粉，因为可以使用更好的信息提取方法来自动收集更多的知识或验证所贡献的内容。建模CGC生成过程和有效地使用积累的知识之间的这种自然联系建议在一个教程中同时涵盖这两个领域。本教程的目标受众包括IR研究人员和研究生，他们希望了解协作生成内容的最新进展和研究机会。本教程的重点是比较现有的方法，并提出IR从业者可以在他们的研究中使用的实用技术。我们还涵盖了开放的研究挑战，以及调查可用资源(软件工具和数据)，以便开始这个研究领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

自引率

0.00%

发文量