调整人类和计算的一致性评估

IF 5.3 2区计算机科学

Computational Linguistics Pub Date : 2024-05-02 DOI:10.1162/coli_a_00518

Jia Peng Lim, Hady W. Lauw

{"title":"调整人类和计算的一致性评估","authors":"Jia Peng Lim, Hady W. Lauw","doi":"10.1162/coli_a_00518","DOIUrl":null,"url":null,"abstract":"Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"64 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Aligning Human and Computational Coherence Evaluations\",\"authors\":\"Jia Peng Lim, Hady W. Lauw\",\"doi\":\"10.1162/coli_a_00518\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.\",\"PeriodicalId\":49089,\"journal\":{\"name\":\"Computational Linguistics\",\"volume\":\"64 1\",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Linguistics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/coli_a_00518\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00518","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自动一致性度量是评估主题模型的一种有效且流行的方法。以往的研究对这些指标与人类判断的假定相关性的描述不一。本研究提出了一种新颖的抽样方法来大规模挖掘话题表征，同时设法减轻抽样带来的偏差，从而能够通过大型语料库对广泛使用的自动一致性指标进行研究。此外，本文还提出了一种新颖的用户研究设计，即不同代理任务的组合，以便更深入地了解人类的决策过程。这种设计包含了简单的评级和离群值检测用户研究的目的。与抽样方法类似，用户研究的范围也非常广泛，包括 40 名研究参与者，他们被分成 8 个不同的研究小组，分别负责评估各自的 100 个主题表征集。通常，在证实这些指标的使用时，人的反应被视为黄金标准。本文通过翻转比较的方式进一步研究了人类判断的可靠性，并针对通用语料库对人类在群体和个体层面的反应进行了新颖的扩展分析。研究结果表明，这些指标与人类判断之间存在适度到良好的相关性，尤其是在通用语料库中，并进一步揭示了人类对一致性的感知。对不同语料的指标间相关性的分析表明，这些指标之间存在中度到良好的相关性。由于这些指标依赖于语料统计，本文进一步研究了不同语料之间的主题差异，揭示了这些指标在应用中的细微差别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Aligning Human and Computational Coherence Evaluations

Automated coherence metrics constitute an efficient and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgment. This work proposes a novel sampling approach to mine topic representations at a large-scale while seeking to mitigate bias from sampling, enabling the investigation of widely-used automated coherence metrics via large corpora. Additionally, this article proposes a novel user study design, an amalgamation of different proxy tasks, to derive a finer insight into the human decision-making processes. This design subsumes the purpose of simple rating and outlier-detection user studies. Similar to the sampling approach, the user study conducted is very extensive, comprising forty study participants split into eight different study groups tasked with evaluating their respective set of one hundred topic representations. Usually, when substantiating the use of these metrics, human responses are treated as the golden standard. This article further investigates the reliability of human judgment by flipping the comparison and conducting a novel extended analysis of human response at the group and individual level against a generic corpus. The investigation results show a moderate to good correlation between these metrics and human judgment, especially for generic corpora, and derive further insights into the human perception of coherence. Analysing inter-metric correlations across corpora shows moderate to good correlation amongst these metrics. As these metrics depend on corpus statistics, this article further investigates the topical differences between corpora revealing nuances in applications of these metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.