ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents

Philippe Muller, Chloé Braud, Mathieu Morey
{"title":"ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents","authors":"Philippe Muller, Chloé Braud, Mathieu Morey","doi":"10.18653/v1/W19-2715","DOIUrl":null,"url":null,"abstract":"Segmentation is the first step in building practical discourse parsers, and is often neglected in discourse parsing studies. The goal is to identify the minimal spans of text to be linked by discourse relations, or to isolate explicit marking of discourse relations. Existing systems on English report F1 scores as high as 95%, but they generally assume gold sentence boundaries and are restricted to English newswire texts annotated within the RST framework. This article presents a generic approach and a system, ToNy, a discourse segmenter developed for the DisRPT shared task where multiple discourse representation schemes, languages and domains are represented. In our experiments, we found that a straightforward sequence prediction architecture with pretrained contextual embeddings is sufficient to reach performance levels comparable to existing systems, when separately trained on each corpus. We report performance between 81% and 96% in F1 score. We also observed that discourse segmentation models only display a moderate generalization capability, even within the same language and discourse representation scheme.","PeriodicalId":243254,"journal":{"name":"Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019","volume":"200 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-2715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

Segmentation is the first step in building practical discourse parsers, and is often neglected in discourse parsing studies. The goal is to identify the minimal spans of text to be linked by discourse relations, or to isolate explicit marking of discourse relations. Existing systems on English report F1 scores as high as 95%, but they generally assume gold sentence boundaries and are restricted to English newswire texts annotated within the RST framework. This article presents a generic approach and a system, ToNy, a discourse segmenter developed for the DisRPT shared task where multiple discourse representation schemes, languages and domains are represented. In our experiments, we found that a straightforward sequence prediction architecture with pretrained contextual embeddings is sufficient to reach performance levels comparable to existing systems, when separately trained on each corpus. We report performance between 81% and 96% in F1 score. We also observed that discourse segmentation models only display a moderate generalization capability, even within the same language and discourse representation scheme.
上下文嵌入用于完整文档的精确多语言话语分割
分词是构建实用语篇解析器的第一步,在语篇分析研究中常常被忽视。目标是确定由话语关系连接的最小文本范围,或者隔离话语关系的明确标记。现有的英语报告F1得分高达95%,但它们通常假设金句边界,并且仅限于在RST框架内注释的英语新闻专线文本。本文提出了一种通用的方法和一个系统,ToNy,一个为DisRPT共享任务开发的话语切分器,其中多个话语表示方案,语言和领域被表示。在我们的实验中,我们发现,当在每个语料库上单独训练时,具有预训练上下文嵌入的简单序列预测架构足以达到与现有系统相当的性能水平。我们报告的F1得分在81%到96%之间。我们还观察到,即使在相同的语言和话语表示方案中,话语分割模型也只显示出中等的泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信