Rosetta-XAI: An automated evaluation and explainability framework for code translation models

IF 1.2 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Software Impacts Pub Date : 2026-04-01 Epub Date: 2026-01-19 DOI:10.1016/j.simpa.2026.100811
Vishnu S. Pendyala, Neha Bais Thakur
{"title":"Rosetta-XAI: An automated evaluation and explainability framework for code translation models","authors":"Vishnu S. Pendyala,&nbsp;Neha Bais Thakur","doi":"10.1016/j.simpa.2026.100811","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents Rosetta-XAI, a comprehensive software framework for evaluating and explaining Large Language Model (LLM) behavior in cross-language code conversion tasks. The system implements a four-stage automated pipeline: (1) code generation by LLMs accessed through the Ollama API inference service, (2) regex-based extraction of code blocks from markdown responses, (3) language-specific syntax and compilation validation with temporary artifact management, and (4) execution with timeout protections and CSV-based checkpoint recovery. The framework supports evaluation of 15 specialized code LLMs (1.3B–34B parameters), including DeepSeek Coder, Code Llama, CodeGemma, and Granite Code across 17 Rosetta Code programming tasks, generating 42 bidirectional conversion pairs among seven languages (C, C++, Go, Java, JavaScript, Python, Rust). Beyond traditional pass@1 accuracy metrics, the system incorporates explainability analysis through Shapley Value Sampling and Feature Ablation techniques implemented via Captum and PyTorch, enabling researchers to quantify token-level feature importance during translation. All pipeline components include XAI-enhanced variants supporting follow-up question analysis for interpretability studies. Built using Python with pandas for metrics aggregation and subprocess management for multi-language execution, the modular architecture separates extraction, validation, and execution concerns. Results are systematically organized into structured directories tracking accepted code, compilation failures, syntax errors, and execution outputs, with comprehensive metrics exported to CSVs for reproducible research and comparative model analysis.</div></div>","PeriodicalId":29771,"journal":{"name":"Software Impacts","volume":"27 ","pages":"Article 100811"},"PeriodicalIF":1.2000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software Impacts","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2665963826000011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/19 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents Rosetta-XAI, a comprehensive software framework for evaluating and explaining Large Language Model (LLM) behavior in cross-language code conversion tasks. The system implements a four-stage automated pipeline: (1) code generation by LLMs accessed through the Ollama API inference service, (2) regex-based extraction of code blocks from markdown responses, (3) language-specific syntax and compilation validation with temporary artifact management, and (4) execution with timeout protections and CSV-based checkpoint recovery. The framework supports evaluation of 15 specialized code LLMs (1.3B–34B parameters), including DeepSeek Coder, Code Llama, CodeGemma, and Granite Code across 17 Rosetta Code programming tasks, generating 42 bidirectional conversion pairs among seven languages (C, C++, Go, Java, JavaScript, Python, Rust). Beyond traditional pass@1 accuracy metrics, the system incorporates explainability analysis through Shapley Value Sampling and Feature Ablation techniques implemented via Captum and PyTorch, enabling researchers to quantify token-level feature importance during translation. All pipeline components include XAI-enhanced variants supporting follow-up question analysis for interpretability studies. Built using Python with pandas for metrics aggregation and subprocess management for multi-language execution, the modular architecture separates extraction, validation, and execution concerns. Results are systematically organized into structured directories tracking accepted code, compilation failures, syntax errors, and execution outputs, with comprehensive metrics exported to CSVs for reproducible research and comparative model analysis.
Rosetta-XAI:用于代码翻译模型的自动评估和可解释性框架
本文介绍了Rosetta-XAI,一个综合的软件框架,用于评估和解释跨语言代码转换任务中的大语言模型(LLM)行为。该系统实现了一个四阶段的自动化流水线:(1)通过Ollama API推理服务访问的llm生成代码;(2)从markdown响应中提取基于regex的代码块;(3)使用临时工件管理的特定语言语法和编译验证;(4)使用超时保护和基于csv的检查点恢复执行。该框架支持评估15个专门的代码llm (1.3 - 34b参数),包括DeepSeek Coder、code Llama、CodeGemma和Granite code,跨越17个Rosetta code编程任务,在7种语言(C、c++、Go、Java、JavaScript、Python、Rust)之间生成42个双向转换对。除了传统的pass@1精度指标外,该系统还通过Captum和PyTorch实现的Shapley值采样和特征消融技术结合了可解释性分析,使研究人员能够在翻译过程中量化标记级特征的重要性。所有管道组件都包括支持可解释性研究的后续问题分析的xai增强变体。使用Python和pandas构建,用于度量聚合和多语言执行的子流程管理,模块化体系结构将提取、验证和执行问题分开。结果被系统地组织到结构化目录中,跟踪可接受的代码、编译失败、语法错误和执行输出,并将综合指标导出到csv,用于可重复的研究和比较模型分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Software Impacts
Software Impacts Software
CiteScore
2.70
自引率
9.50%
发文量
0
审稿时长
16 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书