{"title":"Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models","authors":"Momoko Shiraishi, Takahiro Shinagawa","doi":"arxiv-2409.10506","DOIUrl":null,"url":null,"abstract":"There is strong motivation to translate C code into Rust code due to the\ncontinuing threat of memory safety vulnerabilities in existing C programs and\nthe significant attention paid to Rust as an alternative to the C language.\nWhile large language models (LLMs) show promise for automating this translation\nby generating more natural and safer code than rule-based methods, previous\nstudies have shown that LLM-generated Rust code often fails to compile, even\nfor relatively small C programs, due to significant differences between the two\nlanguages and context window limitations. We propose an LLM-based translation\nscheme that improves the success rate of translating large-scale C code into\ncompilable Rust code. Our approach involves three key techniques: (1)\npre-processing the C code to better align its structure and expressions with\nRust, (2) segmenting the code into optimally sized translation units to avoid\nexceeding the LLM's context window limits, and (3) iteratively compiling and\nrepairing errors while maintaining consistency between translation units using\ncontext-supplementing prompts. Compilation success is an essential first step\nin achieving functional equivalence, as only compilable code can be further\ntested. In experiments with 20 benchmark C programs, including those exceeding\n4 kilo lines of code, we successfully translated all programs into compilable\nRust code without losing corresponding parts of the original code.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
There is strong motivation to translate C code into Rust code due to the
continuing threat of memory safety vulnerabilities in existing C programs and
the significant attention paid to Rust as an alternative to the C language.
While large language models (LLMs) show promise for automating this translation
by generating more natural and safer code than rule-based methods, previous
studies have shown that LLM-generated Rust code often fails to compile, even
for relatively small C programs, due to significant differences between the two
languages and context window limitations. We propose an LLM-based translation
scheme that improves the success rate of translating large-scale C code into
compilable Rust code. Our approach involves three key techniques: (1)
pre-processing the C code to better align its structure and expressions with
Rust, (2) segmenting the code into optimally sized translation units to avoid
exceeding the LLM's context window limits, and (3) iteratively compiling and
repairing errors while maintaining consistency between translation units using
context-supplementing prompts. Compilation success is an essential first step
in achieving functional equivalence, as only compilable code can be further
tested. In experiments with 20 benchmark C programs, including those exceeding
4 kilo lines of code, we successfully translated all programs into compilable
Rust code without losing corresponding parts of the original code.