Towards realistic known-item topics for the ClueWeb

C. Hauff, Matthias Hagen, Anna Beyer, Benno Stein
{"title":"Towards realistic known-item topics for the ClueWeb","authors":"C. Hauff, Matthias Hagen, Anna Beyer, Benno Stein","doi":"10.1145/2362724.2362773","DOIUrl":null,"url":null,"abstract":"Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.\n In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.","PeriodicalId":413481,"journal":{"name":"International Conference on Information Interaction in Context","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Information Interaction in Context","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2362724.2362773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature. In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.
面向ClueWeb的现实已知项目主题
已知项查找是重新查找和重新访问以前见过的项的任务。已知项目的典型示例包括访问的网站、收到的电子邮件或个人桌面上的文档。目前已知项查找的研究严重依赖于已知项查询的语料库和相应的已知项。然而,许多现有的语料库是专有的,不对公众开放(特别是那些来自Web查询日志的语料库),这就不允许进行可重复的研究。现有的公开可用的语料库要么包含自动生成的查询,要么包含在查看已知项本身时手动生成的查询。因此,我们认为这些公共语料库在本质上是相当人工的。在本文中,我们提出了一种方法来创建一个更现实的已知项目主题集,该主题集建立在一个大规模的公共测试语料库之上。从流行的Yahoo!答案平台我们在众包设置中提取已知项目的查询。由于我们确保所有已知条目都对应于公开可用的ClueWeb09语料库(大型静态Web抓取)中的Web页面,因此我们提供了一个可重复的实际Web规模的已知条目搜索环境。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信