{"title":"Text as Data: A New Framework for Machine Learning and the Social Sciences","authors":"K. Freeman","doi":"10.1177/00943061231181317p","DOIUrl":null,"url":null,"abstract":"At its most fundamental, ‘‘social science is the process of creating generalizable knowledge that explains or predicts societal patterns’’ (p. 264). Text as Data: A New Framework for Machine Learning and the Social Sciences seeks to provide readers with a model to do just this, but with a relatively untapped form of data, at least for the social sciences. Using text as data happens frequently in the computer science world, and Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart, the authors of this text, seek to extend known computer science methodology to align with social science methodological principles. The authors bridge this gap by applying our methodological models (some of them, at least) to this novel, timerelevant, and expanding form of data. This is an ambitious text that, at different stages, provides critical insight for undergraduates, graduate students across the social sciences, and practitioners. Text as Data systematically walks readers through the research process, from selection and representation to discovery to measurement and, finally, to inference and prediction. In the first section of the text, they concisely detail this model of research and the justifications behind it for the more novice scholars. The text then introduces each stage of this research process, laying out the assumptions and best practices informing this specific approach with text as data. Common to all of these introductory chapters is the emphasis on the crucial role of the human researcher. The authors do not shy away from a common fear in analyses with ‘‘big data,’’ that human work is becoming obsolete and theory is disappearing. Instead, they make a compelling case that although the analytic processes necessitated by ‘‘big data’’ may seem (and sometimes even be named) as if computers are operating independently of theory and of humans, the social science project will only succeed with the continued and constant engagement of the human-generated ideas behind the projects. Following each of these introductory chapters that adeptly frame the overall endeavor and lay out the novel application of research methods to text data, the authors present a thorough overview of the many ways in which practitioners can pursue research with text data. Here, the authors present work that has already been done in the social sciences (e.g., authorship of the Federalist papers, identifying a model of Congressional ideology from press releases, authorship and tone of tweets from former President Trump) and also work through one or more basic algorithms to link the reader to the algebraic and mathematical progressions that provide the foundation for machine learning (or other similarly opaque procedures). Concluding these detailed presentations of possible steps through the research process, the text progresses to the next step in the research process (i.e., from measurement to inference), clearly linking and overlapping these processes where appropriate. Often methodological training in the social sciences bends in the direction of either inductive or deductive research. Researchers seek, often going to extreme measures, to justify their conceptualization, operationalization, modeling, and interpretation choices prior to embarking on analytical procedures in order to avoid questions of over-fitting, p-hacking, and the like. Alternatively, researchers embark on scholarly pursuits to build theory emerging from their research sites and informants, often utilizing only qualitative techniques to do so. Especially in elementary methodological training, these two tracks are distinct and, sometimes, juxtaposed as opposites. Not so in this text, where the authors use the emergent and exciting field of text data to emphasize the importance of iterative and sequential scholarship. The authors showcase across these four stages of the research process the opportunities for building a comprehensive research agenda that celebrates multiple approaches and Reviews 347","PeriodicalId":46889,"journal":{"name":"Contemporary Sociology-A Journal of Reviews","volume":"52 1","pages":"347 - 348"},"PeriodicalIF":0.3000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Contemporary Sociology-A Journal of Reviews","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/00943061231181317p","RegionNum":4,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"SOCIOLOGY","Score":null,"Total":0}
引用次数: 24
Abstract
At its most fundamental, ‘‘social science is the process of creating generalizable knowledge that explains or predicts societal patterns’’ (p. 264). Text as Data: A New Framework for Machine Learning and the Social Sciences seeks to provide readers with a model to do just this, but with a relatively untapped form of data, at least for the social sciences. Using text as data happens frequently in the computer science world, and Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart, the authors of this text, seek to extend known computer science methodology to align with social science methodological principles. The authors bridge this gap by applying our methodological models (some of them, at least) to this novel, timerelevant, and expanding form of data. This is an ambitious text that, at different stages, provides critical insight for undergraduates, graduate students across the social sciences, and practitioners. Text as Data systematically walks readers through the research process, from selection and representation to discovery to measurement and, finally, to inference and prediction. In the first section of the text, they concisely detail this model of research and the justifications behind it for the more novice scholars. The text then introduces each stage of this research process, laying out the assumptions and best practices informing this specific approach with text as data. Common to all of these introductory chapters is the emphasis on the crucial role of the human researcher. The authors do not shy away from a common fear in analyses with ‘‘big data,’’ that human work is becoming obsolete and theory is disappearing. Instead, they make a compelling case that although the analytic processes necessitated by ‘‘big data’’ may seem (and sometimes even be named) as if computers are operating independently of theory and of humans, the social science project will only succeed with the continued and constant engagement of the human-generated ideas behind the projects. Following each of these introductory chapters that adeptly frame the overall endeavor and lay out the novel application of research methods to text data, the authors present a thorough overview of the many ways in which practitioners can pursue research with text data. Here, the authors present work that has already been done in the social sciences (e.g., authorship of the Federalist papers, identifying a model of Congressional ideology from press releases, authorship and tone of tweets from former President Trump) and also work through one or more basic algorithms to link the reader to the algebraic and mathematical progressions that provide the foundation for machine learning (or other similarly opaque procedures). Concluding these detailed presentations of possible steps through the research process, the text progresses to the next step in the research process (i.e., from measurement to inference), clearly linking and overlapping these processes where appropriate. Often methodological training in the social sciences bends in the direction of either inductive or deductive research. Researchers seek, often going to extreme measures, to justify their conceptualization, operationalization, modeling, and interpretation choices prior to embarking on analytical procedures in order to avoid questions of over-fitting, p-hacking, and the like. Alternatively, researchers embark on scholarly pursuits to build theory emerging from their research sites and informants, often utilizing only qualitative techniques to do so. Especially in elementary methodological training, these two tracks are distinct and, sometimes, juxtaposed as opposites. Not so in this text, where the authors use the emergent and exciting field of text data to emphasize the importance of iterative and sequential scholarship. The authors showcase across these four stages of the research process the opportunities for building a comprehensive research agenda that celebrates multiple approaches and Reviews 347