Jun Wang;Chenghao Su;Yijie Ou;Yanhui Li;Jialiang Tan;Lin Chen;Yuming Zhou
{"title":"Translating to a Low-Resource Language with Compiler Feedback: A Case Study on Cangjie","authors":"Jun Wang;Chenghao Su;Yijie Ou;Yanhui Li;Jialiang Tan;Lin Chen;Yuming Zhou","doi":"10.1109/TSE.2025.3594908","DOIUrl":null,"url":null,"abstract":"In the rapidly advancing field of software development, the demand for practical code translation tools has surged, driven by the need for interoperability across different programming environments. Existing learning-based approaches often need help with low-resource programming languages that lack sufficient parallel code corpora for training. To address these limitations, we propose a novel training framework that begins with monolingual seed corpora, generating parallel datasets via back-translation and incorporating compiler feedback to optimize the translation model. As a case study, we apply our method to train a code translation model for a new-born low-resource programming language, Cangjie. We also construct a parallel test dataset for <inline-formula><tex-math>$\\mathsf{Java}$</tex-math></inline-formula>-to-<inline-formula><tex-math>$\\mathsf{Cangjie}$</tex-math></inline-formula> translation and test cases to evaluate the effectiveness of our approach. Experimental results demonstrate that compiler feedback greatly enhances syntactical correctness, semantic accuracy, and test pass rates of the translated <inline-formula><tex-math>$\\mathsf{Cangjie}$</tex-math></inline-formula> code. These findings highlight the potential of our method to support code translation in low-resource settings, expanding the capabilities of learning-based models for programming languages with limited data availability.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2671-2692"},"PeriodicalIF":5.6000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11106812/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
In the rapidly advancing field of software development, the demand for practical code translation tools has surged, driven by the need for interoperability across different programming environments. Existing learning-based approaches often need help with low-resource programming languages that lack sufficient parallel code corpora for training. To address these limitations, we propose a novel training framework that begins with monolingual seed corpora, generating parallel datasets via back-translation and incorporating compiler feedback to optimize the translation model. As a case study, we apply our method to train a code translation model for a new-born low-resource programming language, Cangjie. We also construct a parallel test dataset for $\mathsf{Java}$-to-$\mathsf{Cangjie}$ translation and test cases to evaluate the effectiveness of our approach. Experimental results demonstrate that compiler feedback greatly enhances syntactical correctness, semantic accuracy, and test pass rates of the translated $\mathsf{Cangjie}$ code. These findings highlight the potential of our method to support code translation in low-resource settings, expanding the capabilities of learning-based models for programming languages with limited data availability.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.