The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts

Q3 Social Sciences
Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo
{"title":"The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts","authors":"Marcio Lima Inácio, Marco Antonio Sobrevilla Cabezudo, Renata Ramisch, Ariani Di Felippo, Thiago Alexandre Salgueiro Pardo","doi":"10.1590/1678-460x202339355159","DOIUrl":null,"url":null,"abstract":"ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.","PeriodicalId":35332,"journal":{"name":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","volume":"247 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1590/1678-460x202339355159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 2

Abstract

ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.
AMR-PT语料库与新闻和观点文本挑战句的语义注释
摘要自然语言处理(NLP)中最流行的语义表示语言之一是抽象意义表示(AMR)。这种形式将单句的意义编码在有向根图中。对于英语,有一个大型的带注释的语料库,为建立或改进现有的NLP方法和应用提供了定性和可重用的数据。对于非英语语言(包括巴西葡萄牙语)的AMR语料库的构建,采用了自动和手动策略。自动标注方法本质上是基于平行语料库的跨语言对齐和AMR标注的继承。手册策略侧重于使AMR英语指南适应目标语言。这两种注释策略都必须处理一些具有挑战性的现象。本文详细探讨了葡萄牙语AMR模型必须适应的一些特征,并介绍了两个注释语料库:AMRNews,一个来自新闻文本的870个注释句子的语料库,以及OpiSums-PT-AMR,包含AMR中404个固执己见的句子。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada
DELTA Documentacao de Estudos em Linguistica Teorica e Aplicada Social Sciences-Linguistics and Language
CiteScore
0.40
自引率
0.00%
发文量
39
审稿时长
52 weeks
期刊介绍: The journal Documentação de Estudos em Lingüística Teórica e Aplicada - DELTA is published by the Pontifícia Universidade Católica de São Paulo / PUC-SP. DELTA has been published since 1985, and in 1992 it became a biannual publication. Editions are published in February and August. The journal is addressed to all areas of study concerning language and speech, whether theoretical or applied; however, only unpublished contributions will be considered. To briefly refer to the journal, the short title DELTA is recommended regarding bibliographies, footnotes, as well as bibliographical strips and references.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信