Enhancing Turkish Coreference Resolution: Insights from deep learning, dropped pronouns, and multilingual transfer learning

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-06-18 DOI:10.1016/j.csl.2024.101681

Tuğba Pamay Arslan, Gülşen Eryiğit

{"title":"Enhancing Turkish Coreference Resolution: Insights from deep learning, dropped pronouns, and multilingual transfer learning","authors":"Tuğba Pamay Arslan, Gülşen Eryiğit","doi":"10.1016/j.csl.2024.101681","DOIUrl":null,"url":null,"abstract":"<div><p>Coreference resolution (CR), which is the identification of in-text mentions that refer to the same entity, is a crucial step in natural language understanding. While CR in English has been studied for quite a long time, studies for pro-dropped and morphologically rich languages is an active research area which has yet to reach sufficient maturity. Turkish, a morphologically highly-rich language, poses interesting challenges for natural language processing tasks, including CR, due to its agglutinative nature and consequent pronoun-dropping phenomenon. This article explores the use of different neural CR architectures (i.e., mention-pair, mention-ranking, and end-to-end) on Turkish, a morphologically highly-rich language, by formulating multiple research questions around the impacts of dropped pronouns, data quality, and interlingual transfer. The preparations made to explore these research questions and the findings obtained as a result of our explorations revealed the first Turkish CR dataset that includes dropped pronoun annotations (of size 4K entities/22K mentions), new state-of-the-art results on Turkish CR, the first neural end-to-end Turkish CR results (70.4% F-score), the first multilingual end-to-end CR results including Turkish (yielding 1.0 percentage points improvement on Turkish) and the demonstration of the positive impact of dropped pronouns on CR of pro-dropped and morphologically rich languages, for the first time in the literature. Our research has brought Turkish end-to-end CR performances (72.0% F-score) to similar levels with other languages, surpassing the baseline scores by 32.1 percentage points.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101681"},"PeriodicalIF":3.1000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000640/pdfft?md5=75cd60c63807520ee823be3bbb1025ae&pid=1-s2.0-S0885230824000640-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000640","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Coreference resolution (CR), which is the identification of in-text mentions that refer to the same entity, is a crucial step in natural language understanding. While CR in English has been studied for quite a long time, studies for pro-dropped and morphologically rich languages is an active research area which has yet to reach sufficient maturity. Turkish, a morphologically highly-rich language, poses interesting challenges for natural language processing tasks, including CR, due to its agglutinative nature and consequent pronoun-dropping phenomenon. This article explores the use of different neural CR architectures (i.e., mention-pair, mention-ranking, and end-to-end) on Turkish, a morphologically highly-rich language, by formulating multiple research questions around the impacts of dropped pronouns, data quality, and interlingual transfer. The preparations made to explore these research questions and the findings obtained as a result of our explorations revealed the first Turkish CR dataset that includes dropped pronoun annotations (of size 4K entities/22K mentions), new state-of-the-art results on Turkish CR, the first neural end-to-end Turkish CR results (70.4% F-score), the first multilingual end-to-end CR results including Turkish (yielding 1.0 percentage points improvement on Turkish) and the demonstration of the positive impact of dropped pronouns on CR of pro-dropped and morphologically rich languages, for the first time in the literature. Our research has brought Turkish end-to-end CR performances (72.0% F-score) to similar levels with other languages, surpassing the baseline scores by 32.1 percentage points.

查看原文本刊更多论文

加强土耳其语的核心参照解析：深度学习、去掉代词和多语言迁移学习的启示

核心参照解析（Coreference resolution，CR）是指识别文本中提及同一实体的内容，是自然语言理解的关键步骤。虽然英语中的核心参照问题已经研究了很长时间，但针对亲疏词和词形丰富的语言的研究是一个活跃的研究领域，尚未达到足够成熟的程度。土耳其语是一种语素高度丰富的语言，由于其聚合性和随之而来的代词掉落现象，为包括 CR 在内的自然语言处理任务带来了有趣的挑战。本文探讨了不同神经 CR 架构（即 mention-pair、mention-ranking 和 end-to-end）在土耳其语这种语素高度丰富的语言上的应用，围绕掉代词的影响、数据质量和语际转移提出了多个研究问题。为探索这些研究问题所做的准备工作以及我们的探索结果揭示了首个包含去掉代词注释的土耳其语 CR 数据集（规模为 4K 个实体/22K 次提及）、土耳其语 CR 的最新结果、首个神经端到端土耳其语 CR 结果（70.4% F-score）、首个包括土耳其语在内的多语言端到端 CR 结果（比土耳其语提高了 1.0 个百分点），以及在文献中首次证明了去掉代词对支持去掉代词和形态丰富语言的 CR 的积极影响。我们的研究使土耳其语的端到端 CR 性能（72.0% F-score）达到了与其他语言相近的水平，比基线分数高出 32.1 个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.