{"title":"Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset","authors":"Nikolitsa Grigoropoulou, Mario L. Small","doi":"10.1177/08944393241245390","DOIUrl":null,"url":null,"abstract":"Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.","PeriodicalId":49509,"journal":{"name":"Social Science Computer Review","volume":"29 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social Science Computer Review","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/08944393241245390","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.
期刊介绍:
Unique Scope Social Science Computer Review is an interdisciplinary journal covering social science instructional and research applications of computing, as well as societal impacts of informational technology. Topics included: artificial intelligence, business, computational social science theory, computer-assisted survey research, computer-based qualitative analysis, computer simulation, economic modeling, electronic modeling, electronic publishing, geographic information systems, instrumentation and research tools, public administration, social impacts of computing and telecommunications, software evaluation, world-wide web resources for social scientists. Interdisciplinary Nature Because the Uses and impacts of computing are interdisciplinary, so is Social Science Computer Review. The journal is of direct relevance to scholars and scientists in a wide variety of disciplines. In its pages you''ll find work in the following areas: sociology, anthropology, political science, economics, psychology, computer literacy, computer applications, and methodology.