Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav
{"title":"利用研究人员领域专业知识注释不平衡数据中的概念","authors":"Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav","doi":"10.1080/19312458.2023.2182278","DOIUrl":null,"url":null,"abstract":"ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.","PeriodicalId":47552,"journal":{"name":"Communication Methods and Measures","volume":"17 1","pages":"250 - 271"},"PeriodicalIF":3.7000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data\",\"authors\":\"Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav\",\"doi\":\"10.1080/19312458.2023.2182278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.\",\"PeriodicalId\":47552,\"journal\":{\"name\":\"Communication Methods and Measures\",\"volume\":\"17 1\",\"pages\":\"250 - 271\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communication Methods and Measures\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1080/19312458.2023.2182278\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMMUNICATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communication Methods and Measures","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1080/19312458.2023.2182278","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}
Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data
ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.
期刊介绍:
Communication Methods and Measures aims to achieve several goals in the field of communication research. Firstly, it aims to bring attention to and showcase developments in both qualitative and quantitative research methodologies to communication scholars. This journal serves as a platform for researchers across the field to discuss and disseminate methodological tools and approaches.
Additionally, Communication Methods and Measures seeks to improve research design and analysis practices by offering suggestions for improvement. It aims to introduce new methods of measurement that are valuable to communication scientists or enhance existing methods. The journal encourages submissions that focus on methods for enhancing research design and theory testing, employing both quantitative and qualitative approaches.
Furthermore, the journal is open to articles devoted to exploring the epistemological aspects relevant to communication research methodologies. It welcomes well-written manuscripts that demonstrate the use of methods and articles that highlight the advantages of lesser-known or newer methods over those traditionally used in communication.
In summary, Communication Methods and Measures strives to advance the field of communication research by showcasing and discussing innovative methodologies, improving research practices, and introducing new measurement methods.