通过时间锚文本探索网络档案

Proceedings of the 2017 ACM on Web Science Conference Pub Date : 2017-06-25 DOI:10.1145/3091478.3091500

Helge Holzmann, W. Nejdl, Avishek Anand

{"title":"通过时间锚文本探索网络档案","authors":"Helge Holzmann, W. Nejdl, Avishek Anand","doi":"10.1145/3091478.3091500","DOIUrl":null,"url":null,"abstract":"Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.","PeriodicalId":165747,"journal":{"name":"Proceedings of the 2017 ACM on Web Science Conference","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Exploring Web Archives Through Temporal Anchor Texts\",\"authors\":\"Helge Holzmann, W. Nejdl, Avishek Anand\",\"doi\":\"10.1145/3091478.3091500\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.\",\"PeriodicalId\":165747,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on Web Science Conference\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on Web Science Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3091478.3091500\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Web Science Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3091478.3091500","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

网络档案在网络的数字化保存方面发挥了重要作用，并为研究社会的过去和进化提供了巨大的机会。这些Web档案是大量的集合，通常以tb和pb为量级。因此，对档案的搜索和探索受到了限制，因为全文索引既耗费资源又耗费计算。我们发现，对于典型的档案访问方法，这是导航和时间的性质，我们并不总是需要索引全文。相反，像锚文本这样有意义的文本替代品在提供有意义的解决方案方面已经走了很长一段路，并且可以作为探索Web存档的合理入口点。本文提出了一种基于时间链接图和相应锚文本的网络档案检索方法。从传统的信息意图出发，我们展示了时间锚文本如何有效地回答超越纯粹导航意图的查询，例如在给定时间段内查找实体的最中心网页。我们提出了索引方法和基于锚文本的时间检索模型。此外，我们还讨论了几个有趣的搜索结果以及一个实验，在这个实验中，我们演示了如何将这些结果集成到数据处理工作流中，以扩展到数千页。在这项分析中，我们能够复制线下研究报告的结果，表明当欧元被引入欧洲货币时，德国餐馆的价格确实上涨了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Web Archives Through Temporal Anchor Texts

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM on Web Science Conference

自引率

0.00%

发文量