Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang
{"title":"Can GPT-O1 Kill All Bugs?","authors":"Haichuan Hu, Ye Shang, Guolin Xu, Congqing He, Quanjun Zhang","doi":"arxiv-2409.10033","DOIUrl":"https://doi.org/arxiv-2409.10033","url":null,"abstract":"ChatGPT has long been proven to be effective in automatic program repair\u0000(APR). With the continuous iterations and upgrades of the ChatGPT version, its\u0000performance in terms of fixes has already reached state-of-the-art levels.\u0000However, there are few works comparing the effectiveness and variations of\u0000different versions of ChatGPT on APR. In this work, we evaluate the performance\u0000of the latest version of ChatGPT (O1-preview and O1-mini), ChatGPT-4o, and\u0000historical version of ChatGPT on APR. We study the improvements of the O1 model\u0000over traditional ChatGPT in terms of APR from multiple perspectives (repair\u0000success rate, repair cost, behavior patterns), and find that O1's repair\u0000capability exceeds that of traditional ChatGPT, successfully fixing all 40 bugs\u0000in the benchmark. Our work can serve as a reference for further in-depth\u0000exploration of the applications of ChatGPT in APR.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA","authors":"Alexander Berndt, Thomas Bach, Sebastian Baltes","doi":"arxiv-2409.10062","DOIUrl":"https://doi.org/arxiv-2409.10062","url":null,"abstract":"Background: Test flakiness is a major problem in the software industry. Flaky\u0000tests fail seemingly at random without changes to the code and thus impede\u0000continuous integration (CI). Some researchers argue that all tests can be\u0000considered flaky and that tests only differ in their frequency of flaky\u0000failures. Aims: With the goal of developing mitigation strategies to reduce the\u0000negative impact of test flakiness, we study characteristics of tests and the\u0000test environment that potentially impact test flakiness. Method: We construct two datasets based on SAP HANA's test results over a\u000012-week period: one based on production data, the other based on targeted test\u0000executions from a dedicated flakiness experiment. We conduct correlation\u0000analysis for test and test environment characteristics with respect to their\u0000influence on the frequency of flaky test failures. Results: In our study, the average test execution time had the strongest\u0000positive correlation with the test flakiness rate (r = 0.79), which confirms\u0000previous studies. Potential reasons for higher flakiness include the larger\u0000test scope of long-running tests or test executions on a slower test\u0000infrastructure. Interestingly, the load on the testing infrastructure was not\u0000correlated with test flakiness. The relationship between test flakiness and\u0000required resources for test execution is inconclusive. Conclusions: Based on our findings, we conclude that splitting long-running\u0000tests can be an important measure for practitioners to cope with test\u0000flakiness, as it enables parallelization of test executions and also reduces\u0000the cost of re-executions. This effectively decreases the negative effects of\u0000test flakiness in complex testing environments. However, when splitting\u0000long-running tests, practitioners need to consider the potential test setup\u0000overhead of test splits.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code","authors":"Jia Feng, Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia","doi":"arxiv-2409.10280","DOIUrl":"https://doi.org/arxiv-2409.10280","url":null,"abstract":"In recent years, the application of large language models (LLMs) to\u0000code-related tasks has gained significant attention. However, existing\u0000evaluation benchmarks often focus on limited scenarios, such as code generation\u0000or completion, which do not reflect the diverse challenges developers face in\u0000real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark\u0000designed to assess LCMs in various development tasks, including code\u0000generation, completion, API recommendation, and test case generation. It\u0000includes 3,897 Java samples and 7,184 Python samples from high-star GitHub\u0000repositories, each annotated with function signatures, docstrings, and API\u0000references to simulate real development environments. Our experiments across\u0000ten LCMs reveal that context improves performance and that data leakage can\u0000lead to overestimation, highlighting the need for more accurate evaluations.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models","authors":"Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue","doi":"arxiv-2409.10066","DOIUrl":"https://doi.org/arxiv-2409.10066","url":null,"abstract":"Autonomous driving systems (ADS) are safety-critical and require\u0000comprehensive testing before their deployment on public roads. While existing\u0000testing approaches primarily aim at the criticality of scenarios, they often\u0000overlook the diversity of the generated scenarios that is also important to\u0000reflect system defects in different aspects. To bridge the gap, we propose\u0000LeGEND, that features a top-down fashion of scenario generation: it starts with\u0000abstract functional scenarios, and then steps downwards to logical and concrete\u0000scenarios, such that scenario diversity can be controlled at the functional\u0000level. However, unlike logical scenarios that can be formally described,\u0000functional scenarios are often documented in natural languages (e.g., accident\u0000reports) and thus cannot be precisely parsed and processed by computers. To\u0000tackle that issue, LeGEND leverages the recent advances of large language\u0000models (LLMs) to transform textual functional scenarios to formal logical\u0000scenarios. To mitigate the distraction of useless information in functional\u0000scenario description, we devise a two-phase transformation that features the\u0000use of an intermediate language; consequently, we adopt two LLMs in LeGEND, one\u0000for extracting information from functional scenarios, the other for converting\u0000the extracted information to formal logical scenarios. We experimentally\u0000evaluate LeGEND on Apollo, an industry-grade ADS from Baidu. Evaluation results\u0000show that LeGEND can effectively identify critical scenarios, and compared to\u0000baseline approaches, LeGEND exhibits evident superiority in diversity of\u0000generated scenarios. Moreover, we also demonstrate the advantages of our\u0000two-phase transformation framework, and the accuracy of the adopted LLMs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour
{"title":"Investigating the Impact of Code Comment Inconsistency on Bug Introducing","authors":"Shiva Radmanesh, Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour","doi":"arxiv-2409.10781","DOIUrl":"https://doi.org/arxiv-2409.10781","url":null,"abstract":"Code comments are essential for clarifying code functionality, improving\u0000readability, and facilitating collaboration among developers. Despite their\u0000importance, comments often become outdated, leading to inconsistencies with the\u0000corresponding code. This can mislead developers and potentially introduce bugs.\u0000Our research investigates the impact of code-comment inconsistency on bug\u0000introduction using large language models, specifically GPT-3.5. We first\u0000compare the performance of the GPT-3.5 model with other state-of-the-art\u0000methods in detecting these inconsistencies, demonstrating the superiority of\u0000GPT-3.5 in this domain. Additionally, we analyze the temporal evolution of\u0000code-comment inconsistencies and their effect on bug proneness over various\u0000timeframes using GPT-3.5 and Odds ratio analysis. Our findings reveal that\u0000inconsistent changes are around 1.5 times more likely to lead to a\u0000bug-introducing commit than consistent changes, highlighting the necessity of\u0000maintaining consistent and up-to-date comments in software development. This\u0000study provides new insights into the relationship between code-comment\u0000inconsistency and software quality, offering a comprehensive analysis of its\u0000impact over time, demonstrating that the impact of code-comment inconsistency\u0000on bug introduction is highest immediately after the inconsistency is\u0000introduced and diminishes over time.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong
{"title":"A Large-Scale Privacy Assessment of Android Third-Party SDKs","authors":"Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong","doi":"arxiv-2409.10411","DOIUrl":"https://doi.org/arxiv-2409.10411","url":null,"abstract":"Third-party Software Development Kits (SDKs) are widely adopted in Android\u0000app development, to effortlessly accelerate development pipelines and enhance\u0000app functionality. However, this convenience raises substantial concerns about\u0000unauthorized access to users' privacy-sensitive information, which could be\u0000further abused for illegitimate purposes like user tracking or monetization.\u0000Our study offers a targeted analysis of user privacy protection among Android\u0000third-party SDKs, filling a critical gap in the Android software supply chain.\u0000It focuses on two aspects of their privacy practices, including data\u0000exfiltration and behavior-policy compliance (or privacy compliance), utilizing\u0000techniques of taint analysis and large language models. It covers 158\u0000widely-used SDKs from two key SDK release platforms, the official one and a\u0000large alternative one. From them, we identified 338 instances of privacy data\u0000exfiltration. On the privacy compliance, our study reveals that more than 30%\u0000of the examined SDKs fail to provide a privacy policy to disclose their data\u0000handling practices. Among those that provide privacy policies, 37% of them\u0000over-collect user data, and 88% falsely claim access to sensitive data. We\u0000revisit the latest versions of the SDKs after 12 months. Our analysis\u0000demonstrates a persistent lack of improvement in these concerning trends. Based\u0000on our findings, we propose three actionable recommendations to mitigate the\u0000privacy leakage risks and enhance privacy protection for Android users. Our\u0000research not only serves as an urgent call for industry attention but also\u0000provides crucial insights for future regulatory interventions.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Centralization potential of automotive E/E architectures","authors":"Lucas Mauser, Stefan Wagner","doi":"arxiv-2409.10690","DOIUrl":"https://doi.org/arxiv-2409.10690","url":null,"abstract":"Current automotive E/E architectures are subject to significant\u0000transformations: Computing-power-intensive advanced driver-assistance systems,\u0000bandwidth-hungry infotainment systems, the connection of the vehicle with the\u0000internet and the consequential need for cyber-security drives the\u0000centralization of E/E architectures. A centralized architecture is often seen\u0000as a key enabler to master those challenges. Available research focuses mostly\u0000on the different types of E/E architectures and contrasts their advantages and\u0000disadvantages. There is a research gap on guidelines for system designers and\u0000function developers to analyze the potential of their systems for\u0000centralization. The present paper aims to quantify centralization potential\u0000reviewing relevant literature and conducting qualitative interviews with\u0000industry practitioners. In literature, we identified seven key automotive\u0000system properties reaching limitations in current automotive architectures:\u0000busload, functional safety, computing power, feature dependencies, development\u0000and maintenance costs, error rate, modularity and flexibility. These properties\u0000serve as quantitative evaluation criteria to estimate whether centralization\u0000would enhance overall system performance. In the interviews, we have validated\u0000centralization and its fundament - the conceptual systems engineering - as\u0000capabilities to mitigate these limitations. By focusing on practical insights\u0000and lessons learned, this research provides system designers with actionable\u0000guidance to optimize their systems, addressing the outlined challenges while\u0000avoiding monolithic architecture. This paper bridges the gap between\u0000theoretical research and practical application, offering valuable takeaways for\u0000practitioners.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan
{"title":"Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face","authors":"Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan","doi":"arxiv-2409.10472","DOIUrl":"https://doi.org/arxiv-2409.10472","url":null,"abstract":"The proliferation of open Pre-trained Language Models (PTLMs) on model\u0000registry platforms like Hugging Face (HF) presents both opportunities and\u0000challenges for companies building products around them. Similar to traditional\u0000software dependencies, PTLMs continue to evolve after a release. However, the\u0000current state of release practices of PTLMs on model registry platforms are\u0000plagued by a variety of inconsistencies, such as ambiguous naming conventions\u0000and inaccessible model training documentation. Given the knowledge gap on\u0000current PTLM release practices, our empirical study uses a mixed-methods\u0000approach to analyze the releases of 52,227 PTLMs on the most well-known model\u0000registry, HF. Our results reveal 148 different naming practices for PTLM\u0000releases, with 40.87% of changes to model weight files not represented in the\u0000adopted name-based versioning practice or their documentation. In addition, we\u0000identified that the 52,227 PTLMs are derived from only 299 different base\u0000models (the modified original models used to create 52,227 PTLMs), with\u0000Fine-tuning and Quantization being the most prevalent modification methods\u0000applied to these base models. Significant gaps in release transparency, in\u0000terms of training dataset specifications and model card availability, still\u0000exist, highlighting the need for standardized documentation. While we\u0000identified a model naming practice explicitly differentiating between major and\u0000minor PTLM releases, we did not find any significant difference in the types of\u0000changes that went into either type of releases, suggesting that major/minor\u0000version numbers for PTLMs often are chosen arbitrarily. Our findings provide\u0000valuable insights to improve PTLM release practices, nudging the field towards\u0000more formal semantic versioning practices.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Code Change with Micro-Changes","authors":"Lei Chen, Michele Lanza, Shinpei Hayashi","doi":"arxiv-2409.09923","DOIUrl":"https://doi.org/arxiv-2409.09923","url":null,"abstract":"A crucial activity in software maintenance and evolution is the comprehension\u0000of the changes performed by developers, when they submit a pull request and/or\u0000perform a commit on the repository. Typically, code changes are represented in\u0000the form of code diffs, textual representations highlighting the differences\u0000between two file versions, depicting the added, removed, and changed lines.\u0000This simplistic representation must be interpreted by developers, and mentally\u0000lifted to a higher abstraction level, that more closely resembles natural\u0000language descriptions, and eases the creation of a mental model of the changes.\u0000However, the textual diff-based representation is cumbersome, and the lifting\u0000requires considerable domain knowledge and programming skills. We present an\u0000approach, based on the concept of micro-change, to overcome these difficulties,\u0000translating code diffs into a series of pre-defined change operations, which\u0000can be described in natural language. We present a catalog of micro-changes,\u0000together with an automated micro-change detector. To evaluate our approach, we\u0000performed an empirical study on a large set of open-source repositories,\u0000focusing on a subset of our micro-change catalog, namely those related to\u0000changes affecting the conditional logic. We found that our detector is capable\u0000of explaining more than 67% of the changes taking place in the systems under\u0000study.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes","authors":"Chenxi Mao, Yuxin Su, Shiwen Shan, Dan Li","doi":"arxiv-2409.10252","DOIUrl":"https://doi.org/arxiv-2409.10252","url":null,"abstract":"WebAssembly (Wasm) is a low-level bytecode format that can run in modern\u0000browsers. With the development of standalone runtimes and the improvement of\u0000the WebAssembly System Interface (WASI), Wasm has further provided a more\u0000complete sandboxed runtime experience for server-side applications, effectively\u0000expanding its application scenarios. However, the implementation of WASI varies\u0000across different runtimes, and suboptimal interface implementations can lead to\u0000performance degradation during interactions between the runtime and the\u0000operating system. Existing research mainly focuses on overall performance\u0000evaluation of runtimes, while studies on WASI implementations are relatively\u0000scarce. To tackle this problem, we propose an eBPF-based WASI performance\u0000analysis framework. It collects key performance metrics of the runtime under\u0000different I/O load conditions, such as total execution time, startup time, WASI\u0000execution time, and syscall time. We can comprehensively analyze the\u0000performance of the runtime's I/O interactions with the operating system.\u0000Additionally, we provide a detailed analysis of the causes behind two specific\u0000WASI performance anomalies. These analytical results will guide the\u0000optimization of standalone runtimes and WASI implementations, enhancing their\u0000efficiency.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}