Sorting URLs out: seeing the web through infrastructural inversion of archival crawling

IF 1.2 Q3 COMMUNICATION

Internet Histories Pub Date : 2023-09-16 DOI:10.1080/24701475.2023.2258697

Emily Maemura

{"title":"Sorting URLs out: seeing the web through infrastructural inversion of archival crawling","authors":"Emily Maemura","doi":"10.1080/24701475.2023.2258697","DOIUrl":null,"url":null,"abstract":"AbstractWeb archives collections have become important sources for Internet scholars by documenting the past versions of web resources. Understanding how these collections are created and curated is of increasing concern and recent web archives scholarship has studied how the artefacts stored in archives represent specific curatorial choices and collecting practices. This paper takes a novel approach in studying web archiving practice, by focusing on the challenges encountered in archival web crawling and what they reveal about the web itself. Inspired by foundational work in infrastructure studies, infrastructural inversion is applied to study how crawler interactions surface otherwise invisible, background or taken-for-granted aspects of the web. This framework is applied to study three examples selected from interviews and ethnographic fieldwork observations of web archiving practices at the Danish Royal Library, with findings demonstrating how the challenges of archival crawling illuminate the web’s varied actors, as well as their changing relationships, power differentials and politics. Ultimately, analysis through infrastructural inversion reveals how collection via crawling positions archives as active participants in web infrastructure, both shaping and shaped by the needs and motivations of other web actors.Keywords: Web archivesweb crawlerscrawler trapsinfrastructural inversioninfrastructure studiessocio-technical systems AcknowledgementsMany thanks to all the participants at the Netarchive for their time, to Zoe LeBlanc, Katie Mackinnon and Karen Wickett for their feedback on an early draft of this article, and to the anonymous reviewers for their helpful comments and suggestions throughout the review process.Disclosure statementNo potential conflict of interest was reported by the author.Notes1 For a more thorough account of the Netarchive’s processes and collecting history, see Schostag and Fønss-Jørgensen (Citation2012), and Laursen and & Møldrup-Dalum (Citation2017).2 An average of two to three event harvests are conducted each year, including both predictable events like regional and national elections, national celebrations or sporting events, as well as unpredictable events such as the financial crisis of 2008, the swine flu outbreak in 2009, a national teacher lockout in 2013, and terrorist attacks in Copenhagen in 2015.3 See W3C’s historic document on HTTP status codes (https://www.w3.org/Protocols/http/HTRESP.html) and RFC 1945 HTTP/1.0 (https://www.ietf.org/rfc/rfc1945.txt).4 IANA maintains a registry of current codes and their descriptions https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml5 CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” and Justie (Citation2021) presents an in-depth history of various CAPTCHA technologies and their implementation.Additional informationFundingSocial Sciences and Humanities Research Council of Canada, Canada Graduate Scholarship 767-2015-2217 and Michael Smith Foreign Study Supplement.Notes on contributorsEmily MaemuraEmily Maemura is Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She is interested in approaches and methods for working with archived web data in the form of large-scale research collections, considering diverse perspectives of the internet as an object and site of study.","PeriodicalId":52252,"journal":{"name":"Internet Histories","volume":"37 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet Histories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24701475.2023.2258697","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMMUNICATION","Score":null,"Total":0}

引用次数: 0

Abstract

AbstractWeb archives collections have become important sources for Internet scholars by documenting the past versions of web resources. Understanding how these collections are created and curated is of increasing concern and recent web archives scholarship has studied how the artefacts stored in archives represent specific curatorial choices and collecting practices. This paper takes a novel approach in studying web archiving practice, by focusing on the challenges encountered in archival web crawling and what they reveal about the web itself. Inspired by foundational work in infrastructure studies, infrastructural inversion is applied to study how crawler interactions surface otherwise invisible, background or taken-for-granted aspects of the web. This framework is applied to study three examples selected from interviews and ethnographic fieldwork observations of web archiving practices at the Danish Royal Library, with findings demonstrating how the challenges of archival crawling illuminate the web’s varied actors, as well as their changing relationships, power differentials and politics. Ultimately, analysis through infrastructural inversion reveals how collection via crawling positions archives as active participants in web infrastructure, both shaping and shaped by the needs and motivations of other web actors.Keywords: Web archivesweb crawlerscrawler trapsinfrastructural inversioninfrastructure studiessocio-technical systems AcknowledgementsMany thanks to all the participants at the Netarchive for their time, to Zoe LeBlanc, Katie Mackinnon and Karen Wickett for their feedback on an early draft of this article, and to the anonymous reviewers for their helpful comments and suggestions throughout the review process.Disclosure statementNo potential conflict of interest was reported by the author.Notes1 For a more thorough account of the Netarchive’s processes and collecting history, see Schostag and Fønss-Jørgensen (Citation2012), and Laursen and & Møldrup-Dalum (Citation2017).2 An average of two to three event harvests are conducted each year, including both predictable events like regional and national elections, national celebrations or sporting events, as well as unpredictable events such as the financial crisis of 2008, the swine flu outbreak in 2009, a national teacher lockout in 2013, and terrorist attacks in Copenhagen in 2015.3 See W3C’s historic document on HTTP status codes (https://www.w3.org/Protocols/http/HTRESP.html) and RFC 1945 HTTP/1.0 (https://www.ietf.org/rfc/rfc1945.txt).4 IANA maintains a registry of current codes and their descriptions https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml5 CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” and Justie (Citation2021) presents an in-depth history of various CAPTCHA technologies and their implementation.Additional informationFundingSocial Sciences and Humanities Research Council of Canada, Canada Graduate Scholarship 767-2015-2217 and Michael Smith Foreign Study Supplement.Notes on contributorsEmily MaemuraEmily Maemura is Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She is interested in approaches and methods for working with archived web data in the form of large-scale research collections, considering diverse perspectives of the internet as an object and site of study.

查看原文本刊更多论文

排序url:通过档案抓取的基础结构反转来观察网络

摘要网络档案馆藏记录了网络资源的历史版本，已成为互联网学者的重要资料来源。了解这些藏品是如何创建和管理的越来越受到关注，最近的网络档案学术研究了存储在档案中的人工制品如何代表特定的策展选择和收集实践。本文采用了一种新颖的方法来研究网络归档实践，重点关注在档案网络爬行中遇到的挑战以及它们揭示的网络本身。受基础设施研究基础工作的启发，基础设施反演被应用于研究爬虫交互如何显示网络中其他不可见的、背景的或理所当然的方面。这一框架被应用于研究三个例子，这些例子是从丹麦皇家图书馆对网络存档实践的访谈和民族志实地考察中选择出来的，研究结果表明，档案爬行的挑战如何阐明了网络上各种各样的参与者，以及他们不断变化的关系、权力差异和政治。最后，通过基础设施倒置的分析揭示了通过爬行收集如何将档案作为网络基础设施的积极参与者，既塑造又塑造了其他网络参与者的需求和动机。关键字:网络档案网络爬虫爬虫陷阱基础设施反转基础设施研究社会技术系统感谢所有Netarchive参与者的宝贵时间，感谢Zoe LeBlanc, Katie Mackinnon和Karen Wickett对本文早期草稿的反馈，感谢匿名审稿人在整个审稿过程中提供的有用意见和建议。披露声明作者未报告潜在的利益冲突。注1有关Netarchive的过程和收集历史的更全面的说明，请参见Schostag和Fønss-Jørgensen (Citation2012)，以及Laursen和& Møldrup-Dalum (Citation2017)平均每年进行两到三次事件收获，既包括可预测的事件，如地区和全国选举、全国庆祝活动或体育赛事，也包括不可预测的事件，如2008年金融危机、2009年猪流感爆发、2013年全国教师停工、参见W3C关于HTTP状态码的历史文档(https://www.w3.org/Protocols/http/HTRESP.html)和RFC 1945 HTTP/1.0 (https://www.ietf.org/rfc/rfc1945.txt).4) IANA维护当前代码及其描述的注册表https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml5 CAPTCHA代表“完全自动化的公共图图测试，以区分计算机和人类。”和Justie (Citation2021)介绍了各种CAPTCHA技术及其实现的深入历史。加拿大社会科学与人文科学研究理事会，加拿大研究生奖学金767-2015-2217和Michael Smith国外学习补助款。作者简介:emily Maemura，伊利诺伊大学香槟分校信息科学学院助理教授。她的研究主要集中在数据实践和管理活动，描述，特征，并重新使用存档的网络数据。她对以大规模研究集合的形式处理存档网络数据的方法和方法感兴趣，考虑到互联网作为研究对象和研究地点的不同视角。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Internet Histories Arts and Humanities-History

CiteScore

1.90

自引率

23.10%

发文量