多文档摘要模型能合成吗?

IF 4.2 1区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jay DeYoung, Stephanie C Martinez, Iain J Marshall, Byron C Wallace
{"title":"多文档摘要模型能合成吗?","authors":"Jay DeYoung, Stephanie C Martinez, Iain J Marshall, Byron C Wallace","doi":"10.1162/tacl_a_00687","DOIUrl":null,"url":null,"abstract":"<p><p>Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately <i>synthesize</i> inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical <i>systematic reviews</i> of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or <i>abstaining</i> when the model produces no good candidate.</p>","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"12 ","pages":"1043-1062"},"PeriodicalIF":4.2000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308705/pdf/","citationCount":"0","resultStr":"{\"title\":\"Do Multi-Document Summarization Models <i>Synthesize</i>?\",\"authors\":\"Jay DeYoung, Stephanie C Martinez, Iain J Marshall, Byron C Wallace\",\"doi\":\"10.1162/tacl_a_00687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately <i>synthesize</i> inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical <i>systematic reviews</i> of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or <i>abstaining</i> when the model produces no good candidate.</p>\",\"PeriodicalId\":33559,\"journal\":{\"name\":\"Transactions of the Association for Computational Linguistics\",\"volume\":\"12 \",\"pages\":\"1043-1062\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308705/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions of the Association for Computational Linguistics\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1162/tacl_a_00687\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/4 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of the Association for Computational Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1162/tacl_a_00687","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/4 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

多文档摘要需要生成输入集合的简明概要。对于某些应用程序,摘要应该准确地综合有关关键方面的输入,例如,关于特定电影的电影评论摘要应该反映评论家的平均共识。作为一个更重要的例子,伴随临床试验结果的生物医学系统综述的叙述性总结应该准确地总结来自个别试验的可能相互矛盾的结果。在本文中,我们的问题是:现代多文档摘要模型在多大程度上隐式地执行这种综合?我们使用一套总结模型,从微调变压器到GPT-4,对意见和证据合成数据集进行实验。我们发现现有的模型部分地进行了综合,但并不完美:即使是表现最好的模型对输入顺序的变化也过于敏感,而对输入组成的变化(例如,正面评论与负面评论的比例)不太敏感。我们提出了一种简单、通用、有效的方法,通过生成显式多样化的候选输出集来提高模型综合能力,然后从这些输出中选择与输入的预期汇总度量最一致的字符串,或者在模型没有产生好的候选时放弃。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Do Multi-Document Summarization Models <i>Synthesize</i>?

Do Multi-Document Summarization Models <i>Synthesize</i>?

Do Multi-Document Summarization Models <i>Synthesize</i>?

Do Multi-Document Summarization Models Synthesize?

Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
32.60
自引率
4.60%
发文量
58
审稿时长
8 weeks
期刊介绍: The highly regarded quarterly journal Computational Linguistics has a companion journal called Transactions of the Association for Computational Linguistics. This open access journal publishes articles in all areas of natural language processing and is an important resource for academic and industry computational linguists, natural language processing experts, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, as well as linguists and philosophers. The journal disseminates work of vital relevance to these professionals on an annual basis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信