Cross-domain analysis of discourse markers in European Portuguese

Q1 Arts and Humanities

Dialogue and Discourse Pub Date : 2018-06-08 DOI:10.5087/dad.2018.103

Vera Cabarrão, Helena Moniz, Fernando Batista, Jaime Ferreira, I. Trancoso, Ana Isabel Mata

{"title":"Cross-domain analysis of discourse markers in European Portuguese","authors":"Vera Cabarrão, Helena Moniz, Fernando Batista, Jaime Ferreira, I. Trancoso, Ana Isabel Mata","doi":"10.5087/dad.2018.103","DOIUrl":null,"url":null,"abstract":"This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.","PeriodicalId":37604,"journal":{"name":"Dialogue and Discourse","volume":"4 1","pages":"79-106"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dialogue and Discourse","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5087/dad.2018.103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Arts and Humanities","Score":null,"Total":0}

引用次数: 5

Abstract

This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones.

查看原文本刊更多论文

欧洲葡萄牙语语篇标记语的跨域分析

本文对欧洲葡萄牙语大学演讲和地图任务对话这两个自发语料库中的话语标记进行了分析，并对一组推文进行了分析，目的是对欧洲葡萄牙语几乎不存在的推文进行分类。我们的研究结果表明，话语标记的选择是领域和说话人相关的。我们还发现，尽管推文中包含的话语标记在其他两个语料库中没有发现，但这三个语料库中最常见的话语标记是相似的。在这项多学科研究中，包括语言学视角和计算方法，话语标记也自动与其他结构性元数据事件区分开来，即句子类单位和不流畅。研究结果表明，语篇标记语和语篇不流畅语在对话语料库中往往同时出现，但在大学讲座中却呈互补分布。我们使用了三个声学韵律特征集和机器学习来自动区分话语标记、不流畅和句子类单位。我们的领域内实验在大学讲座和对话中的准确率分别达到了87%和84%，与我们之前的结果一致。eGeMAPS特征通常用于其他副语言任务，在我们的数据上取得了相当大的性能，特别是考虑到特征集的小尺寸。我们的研究结果表明，转向起始语篇标记通常比不流利语更容易分类，这一结果也在文献中有所报道。为了评估模型跨领域的鲁棒性，我们进行了跨领域评估。所获得的结果大约降低了11%-12%，但我们得出结论，来自一个领域的数据仍然可以用于对另一个领域的相同事件进行分类。总的来说，尽管这项任务很复杂，但这些都是非常令人鼓舞的最新成果。最后，仅使用声学韵律线索，话语标记可以与不流利和不连贯区分开来。为了更好地理解每个特稿的贡献，我们还在对话和大学讲座中报道了特稿的影响。音高特征是区分语篇标记语和不流畅语最相关的特征，即音高斜率。这些特征与话语标记的宽音高范围一致，在一个从非常压缩的音高范围到一个非常宽的音高范围的连续体中，用完全去音的材料或H+L* L*轮廓来表达，带有上行的H音调。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Dialogue and Discourse Arts and Humanities-Language and Linguistics

CiteScore

1.90

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： D&D seeks previously unpublished, high quality articles on the analysis of discourse and dialogue that contain -experimental and/or theoretical studies related to the construction, representation, and maintenance of (linguistic) context -linguistic analysis of phenomena characteristic of discourse and/or dialogue (including, but not limited to: reference and anaphora, presupposition and accommodation, topicality and salience, implicature, ---discourse structure and rhetorical relations, discourse markers and particles, the semantics and -pragmatics of dialogue acts, questions, imperatives, non-sentential utterances, intonation, and meta--communicative phenomena such as repair and grounding) -experimental and/or theoretical studies of agents'' information states and their dynamics in conversational interaction -new analytical frameworks that advance theoretical studies of discourse and dialogue -research on systems performing coreference resolution, discourse structure parsing, event and temporal -structure, and reference resolution in multimodal communication -experimental and/or theoretical results yielding new insight into non-linguistic interaction in -communication -work on natural language understanding (including spoken language understanding), dialogue management, -reasoning, and natural language generation (including text-to-speech) in dialogue systems -work related to the design and engineering of dialogue systems (including, but not limited to: -evaluation, usability design and testing, rapid application deployment, embodied agents, affect detection, -mixed-initiative, adaptation, and user modeling). -extremely well-written surveys of existing work. Highest priority is given to research reports that are specifically written for a multidisciplinary audience. The audience is primarily researchers on discourse and dialogue and its associated fields, including computer scientists, linguists, psychologists, philosophers, roboticists, sociologists.