Morphological and structural complexity analysis of low-resource English-Turkish language pair using neural machine translation models.

IF 2.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-08-11 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.3072

Mehmet Acı, Nisa Vuran Sarı, Çiğdem İnan Acı

{"title":"Morphological and structural complexity analysis of low-resource English-Turkish language pair using neural machine translation models.","authors":"Mehmet Acı, Nisa Vuran Sarı, Çiğdem İnan Acı","doi":"10.7717/peerj-cs.3072","DOIUrl":null,"url":null,"abstract":"Neural machine translation (NMT) has achieved remarkable success in high-resource language pairs; however, its effectiveness for morphologically rich and low-resource languages like Turkish remains underexplored. As a highly agglutinative and morphologically complex language with limited high-quality parallel data, Turkish serves as a representative case for evaluating NMT systems on low-resource and linguistically challenging settings. Its structural divergence from English makes it a critical testbed for assessing tokenization strategies, attention mechanisms, and model generalizability in neural translation. This study investigates the comparative performance of two prominent NMT paradigms-the Transformer architecture, and recurrent-based sequence-to-sequence (Seq2Seq) models with attention for both English-to-Turkish and Turkish-to-English translation. The models are evaluated under various configurations, including different tokenization strategies (Byte Pair Encoding (BPE) vs. Word Tokenization), attention mechanisms (Bahdanau and an exploratory hybrid mechanism combining Bahdanau and Scaled Dot-Product attention), and architectural depths (layer count and attention head number). Extensive experiments using automatic metrics such as BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), and Translation Error Rate (TER) reveal that the Transformer model with three layers, eight attention heads, and BPE tokenization achieved the best performance, obtaining a BLEU score of 47.85 and METEOR score of 44.62 in the English-to-Turkish direction. Similar performance trends were observed in the reverse direction, indicating the model's generalizability. These findings highlight the potential of carefully optimized Transformer-based NMT systems in handling the complexities of morphologically rich, low-resource languages like Turkish in both translation directions.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3072"},"PeriodicalIF":2.5000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453858/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3072","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Neural machine translation (NMT) has achieved remarkable success in high-resource language pairs; however, its effectiveness for morphologically rich and low-resource languages like Turkish remains underexplored. As a highly agglutinative and morphologically complex language with limited high-quality parallel data, Turkish serves as a representative case for evaluating NMT systems on low-resource and linguistically challenging settings. Its structural divergence from English makes it a critical testbed for assessing tokenization strategies, attention mechanisms, and model generalizability in neural translation. This study investigates the comparative performance of two prominent NMT paradigms-the Transformer architecture, and recurrent-based sequence-to-sequence (Seq2Seq) models with attention for both English-to-Turkish and Turkish-to-English translation. The models are evaluated under various configurations, including different tokenization strategies (Byte Pair Encoding (BPE) vs. Word Tokenization), attention mechanisms (Bahdanau and an exploratory hybrid mechanism combining Bahdanau and Scaled Dot-Product attention), and architectural depths (layer count and attention head number). Extensive experiments using automatic metrics such as BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), and Translation Error Rate (TER) reveal that the Transformer model with three layers, eight attention heads, and BPE tokenization achieved the best performance, obtaining a BLEU score of 47.85 and METEOR score of 44.62 in the English-to-Turkish direction. Similar performance trends were observed in the reverse direction, indicating the model's generalizability. These findings highlight the potential of carefully optimized Transformer-based NMT systems in handling the complexities of morphologically rich, low-resource languages like Turkish in both translation directions.

查看原文本刊更多论文

基于神经机器翻译模型的低资源英-土耳其语对形态和结构复杂性分析。

神经机器翻译（NMT）在高资源语言对中取得了显著的成功；然而，它对像土耳其语这样词形丰富而资源匮乏的语言的有效性仍未得到充分探索。作为一种高度粘连且形态复杂的语言，土耳其语具有有限的高质量并行数据，可以作为在低资源和语言挑战性设置下评估NMT系统的代表性案例。它与英语的结构差异使其成为评估神经翻译中标记化策略、注意机制和模型可泛化性的关键测试平台。本研究考察了两种著名的NMT范式——Transformer架构和基于循环的序列对序列（Seq2Seq）模型的比较性能，并关注了英语到土耳其语和土耳其语到英语的翻译。这些模型在各种配置下进行评估，包括不同的标记化策略（字节对编码（BPE） vs.单词标记化）、注意机制（Bahdanau和结合Bahdanau和scale Dot-Product注意的探索性混合机制）和架构深度（层数和注意头数）。使用双语评价替代（BLEU）、显式排序翻译评价度量（METEOR）和翻译错误率（TER）等自动度量进行的大量实验表明，具有三层、八个注意头和BPE标记化的Transformer模型取得了最佳性能，在英语到土耳其语方向上BLEU得分为47.85，METEOR得分为44.62。在相反的方向上观察到类似的性能趋势，表明该模型的泛化性。这些发现强调了精心优化的基于transformer的NMT系统在处理形态学丰富、低资源语言（如土耳其语）的复杂性方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.