跨语言跨时态摘要：数据集、模型、评估

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2024-05-16 DOI:10.1162/coli_a_00519

Ran Zhang, Jihed Ouni, Steffen Eger

{"title":"跨语言跨时态摘要：数据集、模型、评估","authors":"Ran Zhang, Jihed Ouni, Steffen Eger","doi":"10.1162/coli_a_00519","DOIUrl":null,"url":null,"abstract":"While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"44 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation\",\"authors\":\"Ran Zhang, Jihed Ouni, Steffen Eger\",\"doi\":\"10.1162/coli_a_00519\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.\",\"PeriodicalId\":49089,\"journal\":{\"name\":\"Computational Linguistics\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":9.3000,\"publicationDate\":\"2024-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Linguistics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/coli_a_00519\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00519","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管自然语言处理（NLP）领域对摘要化进行了广泛研究，但跨语言跨时空摘要化（CLCTS）在很大程度上是一个尚未开发的领域，它具有提高跨文化可访问性和理解力的潜力。本文全面探讨了 CLCTS 任务，包括数据集创建、建模和评估。我们（1）利用英语和德语的历史小说文本和维基百科摘要，为 hDe-En 和 hEn-De 分别建立了包含 328 个（+127 个）实例和 289 个（+212 个）实例的首个 CLCTS 语料库；（2）检验了流行的转换器端到端模型在不同中间微调任务下的有效性；（3）探索了 GPT-3.5 作为摘要器的潜力；（4）报告了来自人类、GPT-4 和最近几个自动评估指标的评估结果。我们的结果表明，经过中间任务微调的端到端模型生成的摘要质量为中下等，而 GPT-3.5 作为零镜头摘要器，能提供中上等质量的输出。GPT-3.5 似乎还非常擅长对历史文本进行规范化处理。为了评估 GPT-3.5 中的数据污染问题，我们设计了一个对抗攻击方案，发现 GPT-3.5 在处理未见过的源文件时，表现比处理见过的文件时稍差。此外，当源句与其先验知识倒置时，GPT-3.5 有时会出现幻觉，情节遗漏的总结准确率为 0.67，实体交换为 0.71，情节否定为 0.53。总体而言，我们对模型性能的回归结果表明，对于所有模型而言，更长、更旧和更复杂的源文本（所有这些都是历史语言变体的特征）都更难总结，这表明了 CLCTS 任务的难度。在评估方面，我们注意到 GPT-4 和 BERTScore 与人类的评估结果有适度的相关性，但 GPT-4 容易给出较低的分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation

While summarization has been extensively researched in natural language processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation. We (1) build the first CLCTS corpus with 328 (+127) instances for hDe-En and 289 (+212) for hEn-De, leveraging historical fiction texts and Wikipedia summaries in English and German; (2) examine the effectiveness of popular transformer end-to-end models with different intermediate finetuning tasks; (3) explore the potential of GPT-3.5 as a summarizer; (4) report evaluations from humans, GPT-4, and several recent automatic evaluation metrics. Our results indicate that intermediate task finetuned end-to-end models generate bad to moderate quality summaries while GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs. GPT-3.5 also seems very adept at normalizing historical text. To assess data contamination in GPT-3.5, we design an adversarial attack scheme in which we find that GPT-3.5 performs slightly worse for unseen source documents compared to seen documents. Moreover, it sometimes hallucinates when the source sentences are inverted against its prior knowledge with a summarization accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot negation. Overall, our regression results of model performances suggest that longer, older, and more complex source texts (all of which are more characteristic for historical language variants) are harder to summarize for all models, indicating the difficulty of the CLCTS task. Regarding evaluation, we observe that both GPT-4 and BERTScore correlate moderately with human evaluations but GPT-4 is prone to giving lower scores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.