All WARC and no playback: The materialities of data-centered web archives research

IF 5.9 1区社会学 Q1 SOCIAL SCIENCES, INTERDISCIPLINARY

Big Data & Society Pub Date : 2023-01-01 DOI:10.1177/20539517231163172

Emily Maemura

{"title":"All WARC and no playback: The materialities of data-centered web archives research","authors":"Emily Maemura","doi":"10.1177/20539517231163172","DOIUrl":null,"url":null,"abstract":"This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Zoology which reveals how boundary objects and methods standardization are used to enroll actors in the work of collecting for natural history. I extend these concepts by pairing them with frameworks for studying digital materiality and the representational qualities of data artifacts. Through examples drawn from fieldwork observations studying two data-centered research projects, I consider how the materiality of the WARC format influences research methods and approaches to data extraction, selection, and transformation. Findings identify three modalities researchers use to configure WARC data for researcher needs: using indexes to support search queries, constructing derivative formats designed for certain types of analysis, and generating custom-designed datasets tailored for specific research purposes. Findings additionally reveal similarities in how these distinct methods approach automated data extraction by relying upon the WARC's standardized metadata elements. By interrogating whose information needs are being met and taken into account in the design of the WARC's underlying information representation, I reveal effects on the emerging field of web history, and consider alternative approaches to knowledge production with archived web data.","PeriodicalId":47834,"journal":{"name":"Big Data & Society","volume":" ","pages":""},"PeriodicalIF":5.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data & Society","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/20539517231163172","RegionNum":1,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}

引用次数: 2

Abstract

This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Zoology which reveals how boundary objects and methods standardization are used to enroll actors in the work of collecting for natural history. I extend these concepts by pairing them with frameworks for studying digital materiality and the representational qualities of data artifacts. Through examples drawn from fieldwork observations studying two data-centered research projects, I consider how the materiality of the WARC format influences research methods and approaches to data extraction, selection, and transformation. Findings identify three modalities researchers use to configure WARC data for researcher needs: using indexes to support search queries, constructing derivative formats designed for certain types of analysis, and generating custom-designed datasets tailored for specific research purposes. Findings additionally reveal similarities in how these distinct methods approach automated data extraction by relying upon the WARC's standardized metadata elements. By interrogating whose information needs are being met and taken into account in the design of the WARC's underlying information representation, I reveal effects on the emerging field of web history, and consider alternative approaches to knowledge production with archived web data.

查看原文本刊更多论文

所有WARC和无回放：以数据为中心的网络档案研究的材料

本文考察了网络存档(WARC)文件格式，揭示了该格式如何在国际网络存档社区的可互操作工具和方法的开发和标准化中发挥核心作用。在新兴大数据方法的背景下，我考虑了数据的材料构建与收集和研究的信息基础设施之间的社会技术关系。分析的灵感来自于Star和Griesemer的脊椎动物博物馆的历史案例，该案例揭示了如何使用边界对象和标准化方法来招募自然历史收集工作中的参与者。我通过将这些概念与研究数字物质性和数据工件的表征质量的框架配对来扩展这些概念。通过研究两个以数据为中心的研究项目的实地观察得出的例子，我考虑了WARC格式的重要性如何影响研究方法和数据提取、选择和转换的方法。研究结果确定了研究人员用于配置WARC数据以满足研究人员需求的三种模式:使用索引来支持搜索查询，构建为特定类型分析设计的衍生格式，以及生成为特定研究目的量身定制的数据集。研究结果还揭示了这些不同方法通过依赖于WARC的标准化元数据元素来实现自动数据提取的相似之处。通过询问哪些人的信息需求得到了满足，并在WARC的基础信息表示的设计中考虑了这些需求，我揭示了对网络历史这一新兴领域的影响，并考虑了利用存档的网络数据生产知识的替代方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data & Society SOCIAL SCIENCES, INTERDISCIPLINARY-

CiteScore

10.90

自引率

10.60%

发文量

审稿时长

11 weeks

期刊介绍： Big Data & Society (BD&S) is an open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities, and computing and their intersections with the arts and natural sciences. The journal focuses on the implications of Big Data for societies and aims to connect debates about Big Data practices and their effects on various sectors such as academia, social life, industry, business, and government. BD&S considers Big Data as an emerging field of practices, not solely defined by but generative of unique data qualities such as high volume, granularity, data linking, and mining. The journal pays attention to digital content generated both online and offline, encompassing social media, search engines, closed networks (e.g., commercial or government transactions), and open networks like digital archives, open government, and crowdsourced data. Rather than providing a fixed definition of Big Data, BD&S encourages interdisciplinary inquiries, debates, and studies on various topics and themes related to Big Data practices. BD&S seeks contributions that analyze Big Data practices, involve empirical engagements and experiments with innovative methods, and reflect on the consequences of these practices for the representation, realization, and governance of societies. As a digital-only journal, BD&S's platform can accommodate multimedia formats such as complex images, dynamic visualizations, videos, and audio content. The contents of the journal encompass peer-reviewed research articles, colloquia, bookcasts, think pieces, state-of-the-art methods, and work by early career researchers.