How to do human evaluation: A brief introduction to user studies in NLP

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2023-02-06 DOI:10.1017/S1351324922000535

Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu

{"title":"How to do human evaluation: A brief introduction to user studies in NLP","authors":"Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu","doi":"10.1017/S1351324922000535","DOIUrl":null,"url":null,"abstract":"Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1199 - 1222"},"PeriodicalIF":1.9000,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000535","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 3

Abstract

Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.

查看原文本刊更多论文

如何进行人的评估：NLP中的用户研究简介

自然语言处理(NLP)中的许多研究课题，如解释生成、对话建模或机器翻译，都需要超越准确性或F1分数等标准指标的评估，以更加以人为中心的方法。因此，了解如何设计用户研究变得越来越重要。然而，关于NLP用户研究的规划、实施和评估的综合资源很少，这使得没有人类评估领域经验的研究人员很难开始。在本文中，我们总结了用户研究及其设计和评估的最重要方面，并在适当的地方提供了与NLP任务和NLP特定挑战的直接联系。我们(i)概述一般研究设计、伦理考虑和众包需要考虑的因素，(ii)讨论NLP中用户研究的特殊性，并提供选择针对特定NLP任务的问卷、实验设计和评估方法的起点。此外，我们提供了伴随统计评估代码的例子，以弥合理论指导和实际应用之间的差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.