重新体验网络:一个转换和重放存档网页的框架

IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
John A. Berlin, Mat Kelly, Michael L. Nelson, M. Weigle
{"title":"重新体验网络:一个转换和重放存档网页的框架","authors":"John A. Berlin, Mat Kelly, Michael L. Nelson, M. Weigle","doi":"10.1145/3589206","DOIUrl":null,"url":null,"abstract":"When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this paper, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages\",\"authors\":\"John A. Berlin, Mat Kelly, Michael L. Nelson, M. Weigle\",\"doi\":\"10.1145/3589206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this paper, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.\",\"PeriodicalId\":50940,\"journal\":{\"name\":\"ACM Transactions on the Web\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on the Web\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3589206\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3589206","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 1

摘要

当回放存档的网页或纪念品时,基本的期望是该页面应该是可查看的,并且功能与存档时完全相同。然而,这种期望需要在回放时使用web存档来修改页面及其嵌入的资源,以便所有资源和链接都引用存档,而不是原始服务器。尽管这些修改必然会改变表现的状态,但可以理解的是,如果没有它们,就不可能从档案中回放纪念品。网络档案馆回放纪念品的过程和对表现形式的修改因档案馆而异。因此,没有标准的术语来描述回放和所需的修改。在本文中,我们提出了描述现有回放风格的术语,以及网络档案对纪念品进行的修改,以便于回放。由于只在服务器端进行修改时发现了问题,我们提出了一个用于自动生成客户端重写库的通用框架。最后,我们评估了使用生成的客户端重写库来增强现有的网络档案回放系统的有效性,通过对从互联网档案的Wayback Machine回放的纪念品进行爬网,无论是否使用生成的客户机端重写器。通过使用生成的客户端重写器,我们能够将被Wayback Machine的内容安全策略阻止的577个纪念品的累计请求数量减少87.5%,并将累计请求数量增加32.8%。我们还能够回放以前无法从Internet档案中回放的纪念品。这项工作中描述的许多客户端重写思想已经在Wombat中实现,这是一个客户端URL重写系统,由Webrecorder、Pywb和Wayback Machine播放系统使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages
When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this paper, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on the Web
ACM Transactions on the Web 工程技术-计算机:软件工程
CiteScore
4.90
自引率
0.00%
发文量
26
审稿时长
7.5 months
期刊介绍: Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信