Syntactic complexity recognition and analysis in Chinese-English machine translation: A comparative study based on the BLSTM-CRF model.

IF 2.9 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-06-12 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0325721

Yongli Tian

{"title":"Syntactic complexity recognition and analysis in Chinese-English machine translation: A comparative study based on the BLSTM-CRF model.","authors":"Yongli Tian","doi":"10.1371/journal.pone.0325721","DOIUrl":null,"url":null,"abstract":"<p><p>To enhance the recognition and preservation of syntactic complexity in Chinese-English translation, this study proposes an optimized Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) model. Based on the Workshop on Machine Translation (WMT) Chinese-English parallel corpus, an experimental framework is designed for two types of specialized data: complex sentences and cross-linguistic sentence pairs. The model integrates explicit syntactic features, including part-of-speech tags, dependency relations, and syntactic tree depth, and incorporates an attention mechanism to improve the model's ability to capture syntactic complexity. In addition, this study constructs an evaluation framework consisting of eight indicators to assess syntactic complexity recognition and translation quality. These indicators encompass: (1) Average syntactic node depth (higher values indicate greater complexity; typically ranging from 1.0 to 5.0); (2) The number of embedded clause levels (higher values illustrate greater complexity; typically 0-5); (3) Long-distance dependency ratio (higher values indicate broader dependency spans; range 0-1, moderate values preferred); (4) Average branching factor (higher values show denser modifiers; range 1.0-4.0); (5) Syntactic change ratio (lower values demonstrate structural stability; range 0-1); (6) Translation alignment consistency rate (higher values indicate better alignment; range 0-1); (7) Syntactic tree reconstruction cost (lower values refer to smaller structural adjustment overhead; range 0-1); (8) Translation syntactic balance (higher values illustrate more natural syntactic rendering; range 0-1). This indicator system enables comprehensive evaluation of the model's capabilities in syntactic modeling, structural preservation, and cross-linguistic alignment. Experimental results show that the optimized model outperforms baseline models across multiple core indicators. On the complex sentence dataset, the optimized model achieves a long-distance dependency ratio of 0.658 (moderately high), an embedded clause level of 3.167 (indicating complex structure), and an average branching factor of 2.897. The syntactic change ratio is only 0.432, all of which significantly outperform comparative models such as Syntax-Transformer and Syntax-Bidirectional Encoder Representations from Transformers (Syntax-BERT). On the cross-linguistic sentence dataset, the optimized model attains a syntactic tree reconstruction cost of only 0.214 (low adjustment overhead) and a translation alignment consistency rate of 0.894 (high alignment accuracy). This demonstrates remarkable advantages in structural preservation and adjustment. In contrast, comparison models show unstable performance on complex and cross-linguistic data. For example, Syntax-BERT achieves only 2.321 for the embedded clause level, indicating difficulty in handling complex syntactic structures. In summary, by introducing explicit syntactic features and a multidimensional indicator system, this study demonstrates strong modeling capacity in syntactic complexity recognition and achieves better preservation of syntactic structures during translation. This study offers new insights into syntactic complexity modeling in natural language processing and provides valuable theoretical and practical contributions to syntactic processing in machine translation systems.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 6","pages":"e0325721"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12161555/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0325721","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

To enhance the recognition and preservation of syntactic complexity in Chinese-English translation, this study proposes an optimized Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) model. Based on the Workshop on Machine Translation (WMT) Chinese-English parallel corpus, an experimental framework is designed for two types of specialized data: complex sentences and cross-linguistic sentence pairs. The model integrates explicit syntactic features, including part-of-speech tags, dependency relations, and syntactic tree depth, and incorporates an attention mechanism to improve the model's ability to capture syntactic complexity. In addition, this study constructs an evaluation framework consisting of eight indicators to assess syntactic complexity recognition and translation quality. These indicators encompass: (1) Average syntactic node depth (higher values indicate greater complexity; typically ranging from 1.0 to 5.0); (2) The number of embedded clause levels (higher values illustrate greater complexity; typically 0-5); (3) Long-distance dependency ratio (higher values indicate broader dependency spans; range 0-1, moderate values preferred); (4) Average branching factor (higher values show denser modifiers; range 1.0-4.0); (5) Syntactic change ratio (lower values demonstrate structural stability; range 0-1); (6) Translation alignment consistency rate (higher values indicate better alignment; range 0-1); (7) Syntactic tree reconstruction cost (lower values refer to smaller structural adjustment overhead; range 0-1); (8) Translation syntactic balance (higher values illustrate more natural syntactic rendering; range 0-1). This indicator system enables comprehensive evaluation of the model's capabilities in syntactic modeling, structural preservation, and cross-linguistic alignment. Experimental results show that the optimized model outperforms baseline models across multiple core indicators. On the complex sentence dataset, the optimized model achieves a long-distance dependency ratio of 0.658 (moderately high), an embedded clause level of 3.167 (indicating complex structure), and an average branching factor of 2.897. The syntactic change ratio is only 0.432, all of which significantly outperform comparative models such as Syntax-Transformer and Syntax-Bidirectional Encoder Representations from Transformers (Syntax-BERT). On the cross-linguistic sentence dataset, the optimized model attains a syntactic tree reconstruction cost of only 0.214 (low adjustment overhead) and a translation alignment consistency rate of 0.894 (high alignment accuracy). This demonstrates remarkable advantages in structural preservation and adjustment. In contrast, comparison models show unstable performance on complex and cross-linguistic data. For example, Syntax-BERT achieves only 2.321 for the embedded clause level, indicating difficulty in handling complex syntactic structures. In summary, by introducing explicit syntactic features and a multidimensional indicator system, this study demonstrates strong modeling capacity in syntactic complexity recognition and achieves better preservation of syntactic structures during translation. This study offers new insights into syntactic complexity modeling in natural language processing and provides valuable theoretical and practical contributions to syntactic processing in machine translation systems.

查看原文本刊更多论文

汉英机器翻译中句法复杂性识别与分析：基于BLSTM-CRF模型的比较研究

为了提高汉英翻译中句法复杂性的识别和保存能力，本研究提出了一种优化的双向长短期记忆-条件随机场（BiLSTM-CRF）模型。基于机器翻译研讨会（Workshop on Machine Translation， WMT）的汉英平行语料库，设计了一个针对复杂句和跨语言句对两类专业数据的实验框架。该模型集成了显式语法特性，包括词性标记、依赖关系和句法树深度，并集成了一个注意机制，以提高模型捕获句法复杂性的能力。此外，本研究还构建了一个由8个指标组成的评价框架来评估句法复杂性识别和翻译质量。这些指标包括：(1)平均句法节点深度（值越大表示复杂度越高）；通常范围从1.0到5.0)；(2)嵌入子句层次的数量(数值越高说明越复杂；通常0 - 5);(3)异地抚养比（值越大，抚养跨度越大）；范围0-1，最好是中等值)；(4)平均分支因子（数值越高表示修饰因子越密集）；范围1.0 - -4.0);(5)句法变化率(越低表示结构稳定；范围0 - 1);(6)翻译对齐一致性率(数值越高，表示对齐越好；范围0 - 1);(7)句法树重构成本(值越小，结构调整开销越小；范围0 - 1);(8)翻译句法平衡(数值越高说明句法呈现越自然；范围0 - 1)。这个指标系统能够全面评估模型在句法建模、结构保存和跨语言对齐方面的能力。实验结果表明，优化后的模型在多个核心指标上优于基线模型。在复杂句子数据集上，优化模型的远程依赖比为0.658（中等高），嵌入子句水平为3.167（表明结构复杂），平均分支因子为2.897。语法变化率仅为0.432，所有这些都明显优于比较模型，如Syntax-Transformer和Syntax-Bidirectional Encoder Representations from Transformers （Syntax-BERT）。在跨语言句子数据集上，优化模型的句法树重建成本仅为0.214（调整开销低），翻译对齐一致性率为0.894（对齐精度高）。这在结构保存和调整方面具有显著的优势。相比之下，比较模型在复杂和跨语言数据上表现不稳定。例如，在嵌入子句级别，Syntax-BERT仅达到2.321，这表明在处理复杂的语法结构方面存在困难。综上所述，本研究通过引入明确的句法特征和多维度指标体系，在句法复杂性识别方面展示了较强的建模能力，在翻译过程中更好地保留了句法结构。该研究为自然语言处理中的句法复杂性建模提供了新的见解，并为机器翻译系统中的句法处理提供了有价值的理论和实践贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage