Text as Data: A New Framework for Machine Learning and the Social Sciences

IF 0.5 4区社会学 Q4 SOCIOLOGY

Contemporary Sociology-A Journal of Reviews Pub Date : 2023-07-01 DOI:10.1177/00943061231181317p

K. Freeman

{"title":"Text as Data: A New Framework for Machine Learning and the Social Sciences","authors":"K. Freeman","doi":"10.1177/00943061231181317p","DOIUrl":null,"url":null,"abstract":"At its most fundamental, ‘‘social science is the process of creating generalizable knowledge that explains or predicts societal patterns’’ (p. 264). Text as Data: A New Framework for Machine Learning and the Social Sciences seeks to provide readers with a model to do just this, but with a relatively untapped form of data, at least for the social sciences. Using text as data happens frequently in the computer science world, and Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart, the authors of this text, seek to extend known computer science methodology to align with social science methodological principles. The authors bridge this gap by applying our methodological models (some of them, at least) to this novel, timerelevant, and expanding form of data. This is an ambitious text that, at different stages, provides critical insight for undergraduates, graduate students across the social sciences, and practitioners. Text as Data systematically walks readers through the research process, from selection and representation to discovery to measurement and, finally, to inference and prediction. In the first section of the text, they concisely detail this model of research and the justifications behind it for the more novice scholars. The text then introduces each stage of this research process, laying out the assumptions and best practices informing this specific approach with text as data. Common to all of these introductory chapters is the emphasis on the crucial role of the human researcher. The authors do not shy away from a common fear in analyses with ‘‘big data,’’ that human work is becoming obsolete and theory is disappearing. Instead, they make a compelling case that although the analytic processes necessitated by ‘‘big data’’ may seem (and sometimes even be named) as if computers are operating independently of theory and of humans, the social science project will only succeed with the continued and constant engagement of the human-generated ideas behind the projects. Following each of these introductory chapters that adeptly frame the overall endeavor and lay out the novel application of research methods to text data, the authors present a thorough overview of the many ways in which practitioners can pursue research with text data. Here, the authors present work that has already been done in the social sciences (e.g., authorship of the Federalist papers, identifying a model of Congressional ideology from press releases, authorship and tone of tweets from former President Trump) and also work through one or more basic algorithms to link the reader to the algebraic and mathematical progressions that provide the foundation for machine learning (or other similarly opaque procedures). Concluding these detailed presentations of possible steps through the research process, the text progresses to the next step in the research process (i.e., from measurement to inference), clearly linking and overlapping these processes where appropriate. Often methodological training in the social sciences bends in the direction of either inductive or deductive research. Researchers seek, often going to extreme measures, to justify their conceptualization, operationalization, modeling, and interpretation choices prior to embarking on analytical procedures in order to avoid questions of over-fitting, p-hacking, and the like. Alternatively, researchers embark on scholarly pursuits to build theory emerging from their research sites and informants, often utilizing only qualitative techniques to do so. Especially in elementary methodological training, these two tracks are distinct and, sometimes, juxtaposed as opposites. Not so in this text, where the authors use the emergent and exciting field of text data to emphasize the importance of iterative and sequential scholarship. The authors showcase across these four stages of the research process the opportunities for building a comprehensive research agenda that celebrates multiple approaches and Reviews 347","PeriodicalId":46889,"journal":{"name":"Contemporary Sociology-A Journal of Reviews","volume":"52 1","pages":"347 - 348"},"PeriodicalIF":0.5000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Contemporary Sociology-A Journal of Reviews","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00943061231181317p","RegionNum":4,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"SOCIOLOGY","Score":null,"Total":0}

引用次数: 24

Abstract

At its most fundamental, ‘‘social science is the process of creating generalizable knowledge that explains or predicts societal patterns’’ (p. 264). Text as Data: A New Framework for Machine Learning and the Social Sciences seeks to provide readers with a model to do just this, but with a relatively untapped form of data, at least for the social sciences. Using text as data happens frequently in the computer science world, and Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart, the authors of this text, seek to extend known computer science methodology to align with social science methodological principles. The authors bridge this gap by applying our methodological models (some of them, at least) to this novel, timerelevant, and expanding form of data. This is an ambitious text that, at different stages, provides critical insight for undergraduates, graduate students across the social sciences, and practitioners. Text as Data systematically walks readers through the research process, from selection and representation to discovery to measurement and, finally, to inference and prediction. In the first section of the text, they concisely detail this model of research and the justifications behind it for the more novice scholars. The text then introduces each stage of this research process, laying out the assumptions and best practices informing this specific approach with text as data. Common to all of these introductory chapters is the emphasis on the crucial role of the human researcher. The authors do not shy away from a common fear in analyses with ‘‘big data,’’ that human work is becoming obsolete and theory is disappearing. Instead, they make a compelling case that although the analytic processes necessitated by ‘‘big data’’ may seem (and sometimes even be named) as if computers are operating independently of theory and of humans, the social science project will only succeed with the continued and constant engagement of the human-generated ideas behind the projects. Following each of these introductory chapters that adeptly frame the overall endeavor and lay out the novel application of research methods to text data, the authors present a thorough overview of the many ways in which practitioners can pursue research with text data. Here, the authors present work that has already been done in the social sciences (e.g., authorship of the Federalist papers, identifying a model of Congressional ideology from press releases, authorship and tone of tweets from former President Trump) and also work through one or more basic algorithms to link the reader to the algebraic and mathematical progressions that provide the foundation for machine learning (or other similarly opaque procedures). Concluding these detailed presentations of possible steps through the research process, the text progresses to the next step in the research process (i.e., from measurement to inference), clearly linking and overlapping these processes where appropriate. Often methodological training in the social sciences bends in the direction of either inductive or deductive research. Researchers seek, often going to extreme measures, to justify their conceptualization, operationalization, modeling, and interpretation choices prior to embarking on analytical procedures in order to avoid questions of over-fitting, p-hacking, and the like. Alternatively, researchers embark on scholarly pursuits to build theory emerging from their research sites and informants, often utilizing only qualitative techniques to do so. Especially in elementary methodological training, these two tracks are distinct and, sometimes, juxtaposed as opposites. Not so in this text, where the authors use the emergent and exciting field of text data to emphasize the importance of iterative and sequential scholarship. The authors showcase across these four stages of the research process the opportunities for building a comprehensive research agenda that celebrates multiple approaches and Reviews 347

查看原文本刊更多论文

文本即数据:机器学习和社会科学的新框架

最基本的是，“社会科学是创造解释或预测社会模式的可概括知识的过程”（第264页）。文本即数据：机器学习和社会科学的新框架试图为读者提供一个模型来做到这一点，但至少在社会科学领域，它提供了一种相对未开发的数据形式。使用文本作为数据在计算机科学世界中经常发生，本文的作者Justin Grimmer、Margaret E.Roberts和Brandon M.Stewart试图扩展已知的计算机科学方法论，以符合社会科学方法论原则。作者通过将我们的方法论模型（至少其中一些）应用于这种新颖的、与时间相关的、不断扩展的数据形式，弥合了这一差距。这是一篇雄心勃勃的文章，在不同阶段为本科生、社会科学研究生和从业者提供了批判性的见解。文本即数据系统地引导读者完成研究过程，从选择和表示到发现，再到测量，最后到推理和预测。在正文的第一部分，他们简要地详细介绍了这种研究模式及其背后的理由，供更多的新手学者参考。然后，本文介绍了这一研究过程的每个阶段，列出了假设和最佳实践，以文本作为数据为这一具体方法提供信息。所有这些介绍性章节的共同点是强调人类研究者的关键作用。在用“大数据”进行分析时，作者们并不回避一种常见的恐惧，即人类的工作正在过时，理论正在消失。相反，他们提出了一个令人信服的理由，即尽管“大数据”所需的分析过程可能看起来（有时甚至被命名）好像计算机独立于理论和人类运行，但只有在项目背后人类产生的思想持续不断的参与下，社会科学项目才会成功。在这些介绍性章节中的每一章都熟练地阐述了整体努力，并阐述了研究方法在文本数据中的新颖应用，之后，作者对从业者可以通过文本数据进行研究的多种方式进行了全面概述。在这里作者介绍了社会科学领域已经完成的工作（例如，联邦党人论文的作者，从新闻稿中确定国会意识形态的模型，前总统特朗普推文的作者和语气），还通过一种或多种基本算法将读者与代数和数学进展联系起来，这些算法为机器学习（或其他类似的不透明程序）。在结束对研究过程中可能采取的步骤的详细介绍后，文本进入研究过程的下一步（即从测量到推断），在适当的情况下清楚地将这些过程联系起来并重叠。社会科学中的方法论训练往往倾向于归纳或演绎研究。研究人员通常会采取极端措施，在开始分析程序之前，为他们的概念化、操作化、建模和解释选择辩护，以避免过度拟合、p破解等问题。或者，研究人员开始学术追求，从他们的研究地点和信息来源建立理论，通常只使用定性技术。特别是在基本的方法训练中，这两条轨道是不同的，有时甚至是对立的。在本文中并非如此，作者使用文本数据的涌现和令人兴奋的领域来强调迭代和顺序学术的重要性。作者在研究过程的这四个阶段展示了建立一个全面的研究议程的机会，以庆祝多种方法和评论347

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Contemporary Sociology-A Journal of Reviews SOCIOLOGY-

CiteScore

0.20

自引率

0.00%

发文量

202