Rosetta-XAI: An automated evaluation and explainability framework for code translation models

IF 1.2 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Software Impacts Pub Date : 2026-04-01 Epub Date: 2026-01-19 DOI:10.1016/j.simpa.2026.100811

Vishnu S. Pendyala, Neha Bais Thakur

{"title":"Rosetta-XAI: An automated evaluation and explainability framework for code translation models","authors":"Vishnu S. Pendyala, Neha Bais Thakur","doi":"10.1016/j.simpa.2026.100811","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents Rosetta-XAI, a comprehensive software framework for evaluating and explaining Large Language Model (LLM) behavior in cross-language code conversion tasks. The system implements a four-stage automated pipeline: (1) code generation by LLMs accessed through the Ollama API inference service, (2) regex-based extraction of code blocks from markdown responses, (3) language-specific syntax and compilation validation with temporary artifact management, and (4) execution with timeout protections and CSV-based checkpoint recovery. The framework supports evaluation of 15 specialized code LLMs (1.3B–34B parameters), including DeepSeek Coder, Code Llama, CodeGemma, and Granite Code across 17 Rosetta Code programming tasks, generating 42 bidirectional conversion pairs among seven languages (C, C++, Go, Java, JavaScript, Python, Rust). Beyond traditional pass@1 accuracy metrics, the system incorporates explainability analysis through Shapley Value Sampling and Feature Ablation techniques implemented via Captum and PyTorch, enabling researchers to quantify token-level feature importance during translation. All pipeline components include XAI-enhanced variants supporting follow-up question analysis for interpretability studies. Built using Python with pandas for metrics aggregation and subprocess management for multi-language execution, the modular architecture separates extraction, validation, and execution concerns. Results are systematically organized into structured directories tracking accepted code, compilation failures, syntax errors, and execution outputs, with comprehensive metrics exported to CSVs for reproducible research and comparative model analysis.</div></div>","PeriodicalId":29771,"journal":{"name":"Software Impacts","volume":"27 ","pages":"Article 100811"},"PeriodicalIF":1.2000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software Impacts","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2665963826000011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/19 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents Rosetta-XAI, a comprehensive software framework for evaluating and explaining Large Language Model (LLM) behavior in cross-language code conversion tasks. The system implements a four-stage automated pipeline: (1) code generation by LLMs accessed through the Ollama API inference service, (2) regex-based extraction of code blocks from markdown responses, (3) language-specific syntax and compilation validation with temporary artifact management, and (4) execution with timeout protections and CSV-based checkpoint recovery. The framework supports evaluation of 15 specialized code LLMs (1.3B–34B parameters), including DeepSeek Coder, Code Llama, CodeGemma, and Granite Code across 17 Rosetta Code programming tasks, generating 42 bidirectional conversion pairs among seven languages (C, C++, Go, Java, JavaScript, Python, Rust). Beyond traditional pass@1 accuracy metrics, the system incorporates explainability analysis through Shapley Value Sampling and Feature Ablation techniques implemented via Captum and PyTorch, enabling researchers to quantify token-level feature importance during translation. All pipeline components include XAI-enhanced variants supporting follow-up question analysis for interpretability studies. Built using Python with pandas for metrics aggregation and subprocess management for multi-language execution, the modular architecture separates extraction, validation, and execution concerns. Results are systematically organized into structured directories tracking accepted code, compilation failures, syntax errors, and execution outputs, with comprehensive metrics exported to CSVs for reproducible research and comparative model analysis.

查看原文本刊更多论文

Rosetta-XAI：用于代码翻译模型的自动评估和可解释性框架

本文介绍了Rosetta-XAI，一个综合的软件框架，用于评估和解释跨语言代码转换任务中的大语言模型（LLM）行为。该系统实现了一个四阶段的自动化流水线：(1)通过Ollama API推理服务访问的llm生成代码；(2)从markdown响应中提取基于regex的代码块；(3)使用临时工件管理的特定语言语法和编译验证；(4)使用超时保护和基于csv的检查点恢复执行。该框架支持评估15个专门的代码llm （1.3 - 34b参数），包括DeepSeek Coder、code Llama、CodeGemma和Granite code，跨越17个Rosetta code编程任务，在7种语言（C、c++、Go、Java、JavaScript、Python、Rust）之间生成42个双向转换对。除了传统的pass@1精度指标外，该系统还通过Captum和PyTorch实现的Shapley值采样和特征消融技术结合了可解释性分析，使研究人员能够在翻译过程中量化标记级特征的重要性。所有管道组件都包括支持可解释性研究的后续问题分析的xai增强变体。使用Python和pandas构建，用于度量聚合和多语言执行的子流程管理，模块化体系结构将提取、验证和执行问题分开。结果被系统地组织到结构化目录中，跟踪可接受的代码、编译失败、语法错误和执行输出，并将综合指标导出到csv，用于可重复的研究和比较模型分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊