{"title":"Syntactic complexity recognition and analysis in Chinese-English machine translation: A comparative study based on the BLSTM-CRF model.","authors":"Yongli Tian","doi":"10.1371/journal.pone.0325721","DOIUrl":null,"url":null,"abstract":"<p><p>To enhance the recognition and preservation of syntactic complexity in Chinese-English translation, this study proposes an optimized Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) model. Based on the Workshop on Machine Translation (WMT) Chinese-English parallel corpus, an experimental framework is designed for two types of specialized data: complex sentences and cross-linguistic sentence pairs. The model integrates explicit syntactic features, including part-of-speech tags, dependency relations, and syntactic tree depth, and incorporates an attention mechanism to improve the model's ability to capture syntactic complexity. In addition, this study constructs an evaluation framework consisting of eight indicators to assess syntactic complexity recognition and translation quality. These indicators encompass: (1) Average syntactic node depth (higher values indicate greater complexity; typically ranging from 1.0 to 5.0); (2) The number of embedded clause levels (higher values illustrate greater complexity; typically 0-5); (3) Long-distance dependency ratio (higher values indicate broader dependency spans; range 0-1, moderate values preferred); (4) Average branching factor (higher values show denser modifiers; range 1.0-4.0); (5) Syntactic change ratio (lower values demonstrate structural stability; range 0-1); (6) Translation alignment consistency rate (higher values indicate better alignment; range 0-1); (7) Syntactic tree reconstruction cost (lower values refer to smaller structural adjustment overhead; range 0-1); (8) Translation syntactic balance (higher values illustrate more natural syntactic rendering; range 0-1). This indicator system enables comprehensive evaluation of the model's capabilities in syntactic modeling, structural preservation, and cross-linguistic alignment. Experimental results show that the optimized model outperforms baseline models across multiple core indicators. On the complex sentence dataset, the optimized model achieves a long-distance dependency ratio of 0.658 (moderately high), an embedded clause level of 3.167 (indicating complex structure), and an average branching factor of 2.897. The syntactic change ratio is only 0.432, all of which significantly outperform comparative models such as Syntax-Transformer and Syntax-Bidirectional Encoder Representations from Transformers (Syntax-BERT). On the cross-linguistic sentence dataset, the optimized model attains a syntactic tree reconstruction cost of only 0.214 (low adjustment overhead) and a translation alignment consistency rate of 0.894 (high alignment accuracy). This demonstrates remarkable advantages in structural preservation and adjustment. In contrast, comparison models show unstable performance on complex and cross-linguistic data. For example, Syntax-BERT achieves only 2.321 for the embedded clause level, indicating difficulty in handling complex syntactic structures. In summary, by introducing explicit syntactic features and a multidimensional indicator system, this study demonstrates strong modeling capacity in syntactic complexity recognition and achieves better preservation of syntactic structures during translation. This study offers new insights into syntactic complexity modeling in natural language processing and provides valuable theoretical and practical contributions to syntactic processing in machine translation systems.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 6","pages":"e0325721"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12161555/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0325721","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
To enhance the recognition and preservation of syntactic complexity in Chinese-English translation, this study proposes an optimized Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) model. Based on the Workshop on Machine Translation (WMT) Chinese-English parallel corpus, an experimental framework is designed for two types of specialized data: complex sentences and cross-linguistic sentence pairs. The model integrates explicit syntactic features, including part-of-speech tags, dependency relations, and syntactic tree depth, and incorporates an attention mechanism to improve the model's ability to capture syntactic complexity. In addition, this study constructs an evaluation framework consisting of eight indicators to assess syntactic complexity recognition and translation quality. These indicators encompass: (1) Average syntactic node depth (higher values indicate greater complexity; typically ranging from 1.0 to 5.0); (2) The number of embedded clause levels (higher values illustrate greater complexity; typically 0-5); (3) Long-distance dependency ratio (higher values indicate broader dependency spans; range 0-1, moderate values preferred); (4) Average branching factor (higher values show denser modifiers; range 1.0-4.0); (5) Syntactic change ratio (lower values demonstrate structural stability; range 0-1); (6) Translation alignment consistency rate (higher values indicate better alignment; range 0-1); (7) Syntactic tree reconstruction cost (lower values refer to smaller structural adjustment overhead; range 0-1); (8) Translation syntactic balance (higher values illustrate more natural syntactic rendering; range 0-1). This indicator system enables comprehensive evaluation of the model's capabilities in syntactic modeling, structural preservation, and cross-linguistic alignment. Experimental results show that the optimized model outperforms baseline models across multiple core indicators. On the complex sentence dataset, the optimized model achieves a long-distance dependency ratio of 0.658 (moderately high), an embedded clause level of 3.167 (indicating complex structure), and an average branching factor of 2.897. The syntactic change ratio is only 0.432, all of which significantly outperform comparative models such as Syntax-Transformer and Syntax-Bidirectional Encoder Representations from Transformers (Syntax-BERT). On the cross-linguistic sentence dataset, the optimized model attains a syntactic tree reconstruction cost of only 0.214 (low adjustment overhead) and a translation alignment consistency rate of 0.894 (high alignment accuracy). This demonstrates remarkable advantages in structural preservation and adjustment. In contrast, comparison models show unstable performance on complex and cross-linguistic data. For example, Syntax-BERT achieves only 2.321 for the embedded clause level, indicating difficulty in handling complex syntactic structures. In summary, by introducing explicit syntactic features and a multidimensional indicator system, this study demonstrates strong modeling capacity in syntactic complexity recognition and achieves better preservation of syntactic structures during translation. This study offers new insights into syntactic complexity modeling in natural language processing and provides valuable theoretical and practical contributions to syntactic processing in machine translation systems.
期刊介绍:
PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides:
* Open-access—freely accessible online, authors retain copyright
* Fast publication times
* Peer review by expert, practicing researchers
* Post-publication tools to indicate quality and impact
* Community-based dialogue on articles
* Worldwide media coverage