Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-07-08 DOI:10.48550/arXiv.2307.04018

Yulong Chen, Huajian Zhang, Yijie Zhou, Xuefeng Bai, Yueguan Wang, Ming Zhong, Jianhao Yan, Yafu Li, Judy Li, Xianchao Zhu, Yue Zhang

{"title":"Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation","authors":"Yulong Chen, Huajian Zhang, Yijie Zhou, Xuefeng Bai, Yueguan Wang, Ming Zhong, Jianhao Yan, Yafu Li, Judy Li, Xianchao Zhu, Yue Zhang","doi":"10.48550/arXiv.2307.04018","DOIUrl":null,"url":null,"abstract":"Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes.To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context.ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text.Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process.Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation.Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.04018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes.To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context.ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text.Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process.Experimental results show that 2-Step method surpasses strong baselines on ConvSumX under both automatic and human evaluation.Analysis shows that both source input text and summary are crucial for modeling cross-lingual summaries.

查看原文本刊更多论文

重新审视跨语言摘要:基于语料库的研究和改进标注的新基准

大多数现有的跨语言摘要(CLS)工作通过简单直接地将预先标注的摘要从一种语言翻译成另一种语言来构建CLS语料库，这可能包含摘要和翻译过程中的错误。为了解决这个问题，我们提出了ConvSumX，一个跨语言会话摘要基准，通过一个新的注释模式，显式地考虑源输入上下文。ConvSumX在不同的现实场景下由2个子任务组成，每个子任务涵盖3个语言方向。我们对ConvSumX和3个广泛使用的人工标注CLS语料库进行了深入的分析，实证发现ConvSumX对输入文本更忠实。此外，基于同样的直觉，我们提出了一种以对话和摘要为输入来模拟人类注释过程的2-Step方法。实验结果表明，2-Step方法在自动和人工评估下都优于ConvSumX上的强基线。分析表明，源输入文本和摘要对于跨语言摘要建模至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量