{"title":"Sorting URLs out: seeing the web through infrastructural inversion of archival crawling","authors":"Emily Maemura","doi":"10.1080/24701475.2023.2258697","DOIUrl":null,"url":null,"abstract":"AbstractWeb archives collections have become important sources for Internet scholars by documenting the past versions of web resources. Understanding how these collections are created and curated is of increasing concern and recent web archives scholarship has studied how the artefacts stored in archives represent specific curatorial choices and collecting practices. This paper takes a novel approach in studying web archiving practice, by focusing on the challenges encountered in archival web crawling and what they reveal about the web itself. Inspired by foundational work in infrastructure studies, infrastructural inversion is applied to study how crawler interactions surface otherwise invisible, background or taken-for-granted aspects of the web. This framework is applied to study three examples selected from interviews and ethnographic fieldwork observations of web archiving practices at the Danish Royal Library, with findings demonstrating how the challenges of archival crawling illuminate the web’s varied actors, as well as their changing relationships, power differentials and politics. Ultimately, analysis through infrastructural inversion reveals how collection via crawling positions archives as active participants in web infrastructure, both shaping and shaped by the needs and motivations of other web actors.Keywords: Web archivesweb crawlerscrawler trapsinfrastructural inversioninfrastructure studiessocio-technical systems AcknowledgementsMany thanks to all the participants at the Netarchive for their time, to Zoe LeBlanc, Katie Mackinnon and Karen Wickett for their feedback on an early draft of this article, and to the anonymous reviewers for their helpful comments and suggestions throughout the review process.Disclosure statementNo potential conflict of interest was reported by the author.Notes1 For a more thorough account of the Netarchive’s processes and collecting history, see Schostag and Fønss-Jørgensen (Citation2012), and Laursen and & Møldrup-Dalum (Citation2017).2 An average of two to three event harvests are conducted each year, including both predictable events like regional and national elections, national celebrations or sporting events, as well as unpredictable events such as the financial crisis of 2008, the swine flu outbreak in 2009, a national teacher lockout in 2013, and terrorist attacks in Copenhagen in 2015.3 See W3C’s historic document on HTTP status codes (https://www.w3.org/Protocols/http/HTRESP.html) and RFC 1945 HTTP/1.0 (https://www.ietf.org/rfc/rfc1945.txt).4 IANA maintains a registry of current codes and their descriptions https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml5 CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” and Justie (Citation2021) presents an in-depth history of various CAPTCHA technologies and their implementation.Additional informationFundingSocial Sciences and Humanities Research Council of Canada, Canada Graduate Scholarship 767-2015-2217 and Michael Smith Foreign Study Supplement.Notes on contributorsEmily MaemuraEmily Maemura is Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She is interested in approaches and methods for working with archived web data in the form of large-scale research collections, considering diverse perspectives of the internet as an object and site of study.","PeriodicalId":52252,"journal":{"name":"Internet Histories","volume":null,"pages":null},"PeriodicalIF":1.0000,"publicationDate":"2023-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet Histories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24701475.2023.2258697","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMMUNICATION","Score":null,"Total":0}
引用次数: 0
Abstract
AbstractWeb archives collections have become important sources for Internet scholars by documenting the past versions of web resources. Understanding how these collections are created and curated is of increasing concern and recent web archives scholarship has studied how the artefacts stored in archives represent specific curatorial choices and collecting practices. This paper takes a novel approach in studying web archiving practice, by focusing on the challenges encountered in archival web crawling and what they reveal about the web itself. Inspired by foundational work in infrastructure studies, infrastructural inversion is applied to study how crawler interactions surface otherwise invisible, background or taken-for-granted aspects of the web. This framework is applied to study three examples selected from interviews and ethnographic fieldwork observations of web archiving practices at the Danish Royal Library, with findings demonstrating how the challenges of archival crawling illuminate the web’s varied actors, as well as their changing relationships, power differentials and politics. Ultimately, analysis through infrastructural inversion reveals how collection via crawling positions archives as active participants in web infrastructure, both shaping and shaped by the needs and motivations of other web actors.Keywords: Web archivesweb crawlerscrawler trapsinfrastructural inversioninfrastructure studiessocio-technical systems AcknowledgementsMany thanks to all the participants at the Netarchive for their time, to Zoe LeBlanc, Katie Mackinnon and Karen Wickett for their feedback on an early draft of this article, and to the anonymous reviewers for their helpful comments and suggestions throughout the review process.Disclosure statementNo potential conflict of interest was reported by the author.Notes1 For a more thorough account of the Netarchive’s processes and collecting history, see Schostag and Fønss-Jørgensen (Citation2012), and Laursen and & Møldrup-Dalum (Citation2017).2 An average of two to three event harvests are conducted each year, including both predictable events like regional and national elections, national celebrations or sporting events, as well as unpredictable events such as the financial crisis of 2008, the swine flu outbreak in 2009, a national teacher lockout in 2013, and terrorist attacks in Copenhagen in 2015.3 See W3C’s historic document on HTTP status codes (https://www.w3.org/Protocols/http/HTRESP.html) and RFC 1945 HTTP/1.0 (https://www.ietf.org/rfc/rfc1945.txt).4 IANA maintains a registry of current codes and their descriptions https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml5 CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” and Justie (Citation2021) presents an in-depth history of various CAPTCHA technologies and their implementation.Additional informationFundingSocial Sciences and Humanities Research Council of Canada, Canada Graduate Scholarship 767-2015-2217 and Michael Smith Foreign Study Supplement.Notes on contributorsEmily MaemuraEmily Maemura is Assistant Professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. Her research focuses on data practices and the activities of curation, description, characterization, and re-use of archived web data. She is interested in approaches and methods for working with archived web data in the form of large-scale research collections, considering diverse perspectives of the internet as an object and site of study.