To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages

IF 4.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on the Web Pub Date : 2023-07-11 DOI:https://dl.acm.org/doi/10.1145/3589206

John Berlin, Mat Kelly, Michael L. Nelson, Michele C. Weigle

{"title":"To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages","authors":"John Berlin, Mat Kelly, Michael L. Nelson, Michele C. Weigle","doi":"https://dl.acm.org/doi/10.1145/3589206","DOIUrl":null,"url":null,"abstract":"<p>When replaying an archived web page, or <i>memento</i>, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.</p>","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":"43 8","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3589206","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this article, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.

查看原文本刊更多论文

重新体验网络:一个转换和重放存档网页的框架

当重新播放存档的网页或纪念品时，基本的期望是页面应该是可见的，并且功能应该与存档时完全相同。然而，这种期望要求web存档在重放时修改页面及其嵌入的资源，以便所有资源和链接都引用存档而不是原始服务器。虽然这些修改必然会改变再现的状态，但可以理解的是，没有它们，从档案中重播纪念品是不可能的。在不同的档案中，纪念品的重放过程和对网络档案的表述所做的修改是不同的。因此，没有标准术语来描述重放和需要的修改。在这篇文章中，我们提出了术语来描述现有的重放风格和网络档案对纪念品的修改，以方便重放。由于仅在服务器端修改时发现的问题，我们提出了一个用于自动生成客户端重写库的通用框架。最后，我们评估了使用生成的客户端重写库的有效性，通过在有或没有生成的客户端重写器的情况下从Internet Archive的Wayback Machine中爬行重播的纪念品，来增强现有的web档案重播系统。通过使用生成的客户端重写器，我们能够将577个纪念品被Wayback Machine的内容安全策略阻止的累计请求数量减少87.5%，并将累计请求数量增加32.8%。我们还能够从互联网档案中重播以前无法重播的纪念品。本文中描述的许多客户端重写思想已经在Wombat中实现了，Wombat是Webrecorder、Pywb和Wayback Machine播放系统使用的一个客户端URL重写系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.