{"title":"Automatic Coding of Open-ended Questions into Multiple Classes: Whether and How to Use Double Coded Data","authors":"Zhoushanyue He, Matthias Schonlau","doi":"10.18148/SRM/2020.V14I3.7639","DOIUrl":null,"url":null,"abstract":"Responses to open-ended questions in surveys are usually coded into pre-specified classes, manually or automatically using a statistical learning algorithm. Automatic coding of open-ended responses relies on a set of manually coded responses, based on which a statistical learning model is fitted. In this paper, we investigate whether and how double coding can help improve the automatic classification of open-ended responses. We evaluate four strategies for training the statistical algorithm on double coded data, using experiments on simulated and real data. We find that, when the data are already double-coded (i.e. double coding does not incur additional costs), double coding where an expert resolves intercoder disagreement leads to the greatest classification accuracy. However, when we have a fixed budget for manually coding, single coding is preferable if the coding error rate is anticipated to be less than about 35% to 45%.","PeriodicalId":46454,"journal":{"name":"Survey Research Methods","volume":"14 1","pages":"267-287"},"PeriodicalIF":0.9000,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Survey Research Methods","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.18148/SRM/2020.V14I3.7639","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 6
Abstract
Responses to open-ended questions in surveys are usually coded into pre-specified classes, manually or automatically using a statistical learning algorithm. Automatic coding of open-ended responses relies on a set of manually coded responses, based on which a statistical learning model is fitted. In this paper, we investigate whether and how double coding can help improve the automatic classification of open-ended responses. We evaluate four strategies for training the statistical algorithm on double coded data, using experiments on simulated and real data. We find that, when the data are already double-coded (i.e. double coding does not incur additional costs), double coding where an expert resolves intercoder disagreement leads to the greatest classification accuracy. However, when we have a fixed budget for manually coding, single coding is preferable if the coding error rate is anticipated to be less than about 35% to 45%.