用BERT实现开放式问题的自动分类

IF 1.6 4区数学 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS

Journal of Survey Statistics and Methodology Pub Date : 2022-09-13 DOI:10.1093/jssam/smad015

Hyukjun Gweon, Matthias Schonlau

{"title":"用BERT实现开放式问题的自动分类","authors":"Hyukjun Gweon, Matthias Schonlau","doi":"10.1093/jssam/smad015","DOIUrl":null,"url":null,"abstract":"\n Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually-coded text answers. Recently, pretraining a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pretrained language model, is more effective at automated coding of answers to open-ended questions than other non-pretrained statistical learning approaches. We found fine-tuning the pretrained BERT parameters is essential as otherwise BERT is not competitive. Second, we found fine-tuned BERT barely beats the non-pretrained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT’s relative advantage increases rapidly when more manually coded observations (e.g., 200–400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Automated Classification for Open-Ended Questions with BERT\",\"authors\":\"Hyukjun Gweon, Matthias Schonlau\",\"doi\":\"10.1093/jssam/smad015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually-coded text answers. Recently, pretraining a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pretrained language model, is more effective at automated coding of answers to open-ended questions than other non-pretrained statistical learning approaches. We found fine-tuning the pretrained BERT parameters is essential as otherwise BERT is not competitive. Second, we found fine-tuned BERT barely beats the non-pretrained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT’s relative advantage increases rapidly when more manually coded observations (e.g., 200–400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.\",\"PeriodicalId\":17146,\"journal\":{\"name\":\"Journal of Survey Statistics and Methodology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2022-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Survey Statistics and Methodology\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1093/jssam/smad015\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"SOCIAL SCIENCES, MATHEMATICAL METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Survey Statistics and Methodology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/jssam/smad015","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 2

摘要

将开放式问题的文本数据手动编码为不同类别既耗时又昂贵。自动编码使用统计/机器学习对手动编码的文本答案的一小部分进行训练。最近，在大量不相关的数据上预训练通用语言模型，然后将模型适应特定的应用程序，在自然语言处理中被证明是有效的。使用两个数据集，我们实证研究了目前占主导地位的预训练语言模型BERT在开放式问题答案的自动编码方面是否比其他未经预训练的统计学习方法更有效。我们发现微调预训练的BERT参数是至关重要的，否则BERT就没有竞争力。其次，我们发现，当对100个手动编码的观测值进行训练时，微调后的BERT在分类精度方面几乎没有超过未经预训练的统计学习方法。然而，当有更多手动编码的观测值（例如200–400）可用于训练时，BERT的相对优势迅速增加。我们得出的结论是，对于自动编码开放式问题的答案，BERT比支持向量机和boosting等非预训练模型更可取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated Classification for Open-Ended Questions with BERT

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually-coded text answers. Recently, pretraining a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pretrained language model, is more effective at automated coding of answers to open-ended questions than other non-pretrained statistical learning approaches. We found fine-tuning the pretrained BERT parameters is essential as otherwise BERT is not competitive. Second, we found fine-tuned BERT barely beats the non-pretrained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT’s relative advantage increases rapidly when more manually coded observations (e.g., 200–400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Survey Statistics and Methodology Multiple-

CiteScore

4.30

自引率

9.50%

发文量

期刊介绍： The Journal of Survey Statistics and Methodology, sponsored by AAPOR and the American Statistical Association, began publishing in 2013. Its objective is to publish cutting edge scholarly articles on statistical and methodological issues for sample surveys, censuses, administrative record systems, and other related data. It aims to be the flagship journal for research on survey statistics and methodology. Topics of interest include survey sample design, statistical inference, nonresponse, measurement error, the effects of modes of data collection, paradata and responsive survey design, combining data from multiple sources, record linkage, disclosure limitation, and other issues in survey statistics and methodology. The journal publishes both theoretical and applied papers, provided the theory is motivated by an important applied problem and the applied papers report on research that contributes generalizable knowledge to the field. Review papers are also welcomed. Papers on a broad range of surveys are encouraged, including (but not limited to) surveys concerning business, economics, marketing research, social science, environment, epidemiology, biostatistics and official statistics. The journal has three sections. The Survey Statistics section presents papers on innovative sampling procedures, imputation, weighting, measures of uncertainty, small area inference, new methods of analysis, and other statistical issues related to surveys. The Survey Methodology section presents papers that focus on methodological research, including methodological experiments, methods of data collection and use of paradata. The Applications section contains papers involving innovative applications of methods and providing practical contributions and guidance, and/or significant new findings.