How to do human evaluation: A brief introduction to user studies in NLP

IF 2.3 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu
{"title":"How to do human evaluation: A brief introduction to user studies in NLP","authors":"Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu","doi":"10.1017/S1351324922000535","DOIUrl":null,"url":null,"abstract":"Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1199 - 1222"},"PeriodicalIF":2.3000,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000535","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 3

Abstract

Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.
如何进行人的评估:NLP中的用户研究简介
自然语言处理(NLP)中的许多研究课题,如解释生成、对话建模或机器翻译,都需要超越准确性或F1分数等标准指标的评估,以更加以人为中心的方法。因此,了解如何设计用户研究变得越来越重要。然而,关于NLP用户研究的规划、实施和评估的综合资源很少,这使得没有人类评估领域经验的研究人员很难开始。在本文中,我们总结了用户研究及其设计和评估的最重要方面,并在适当的地方提供了与NLP任务和NLP特定挑战的直接联系。我们(i)概述一般研究设计、伦理考虑和众包需要考虑的因素,(ii)讨论NLP中用户研究的特殊性,并提供选择针对特定NLP任务的问卷、实验设计和评估方法的起点。此外,我们提供了伴随统计评估代码的例子,以弥合理论指导和实际应用之间的差距。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Natural Language Engineering
Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
5.90
自引率
12.00%
发文量
60
审稿时长
>12 weeks
期刊介绍: Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信