{"title":"Insights from Benchmarking Frontier Language Models on Web App Code Generation","authors":"Yi Cui","doi":"arxiv-2409.05177","DOIUrl":"https://doi.org/arxiv-2409.05177","url":null,"abstract":"This paper presents insights from evaluating 16 frontier large language\u0000models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the\u0000ability of LLMs to generate web application code. The results reveal that while\u0000all models possess similar underlying knowledge, their performance is\u0000differentiated by the frequency of mistakes they make. By analyzing lines of\u0000code (LOC) and failure distributions, we find that writing correct code is more\u0000complex than generating incorrect code. Furthermore, prompt engineering shows\u0000limited efficacy in reducing errors beyond specific cases. These findings\u0000suggest that further advancements in coding LLM should emphasize on model\u0000reliability and mistake minimization.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roy AbitbolIBM Research Israel, Eyal CohenIBM Research Israel, Muhammad KanaanIBM Research Israel, Bhavna AgrawalIBM Research USA, Yingjie LiIBM Research USA, Anuradha BhamidipatyIBM Research USA, Erez BilgoryIBM Research Israel
{"title":"KModels: Unlocking AI for Business Applications","authors":"Roy AbitbolIBM Research Israel, Eyal CohenIBM Research Israel, Muhammad KanaanIBM Research Israel, Bhavna AgrawalIBM Research USA, Yingjie LiIBM Research USA, Anuradha BhamidipatyIBM Research USA, Erez BilgoryIBM Research Israel","doi":"arxiv-2409.05919","DOIUrl":"https://doi.org/arxiv-2409.05919","url":null,"abstract":"As artificial intelligence (AI) continues to rapidly advance, there is a\u0000growing demand to integrate AI capabilities into existing business\u0000applications. However, a significant gap exists between the rapid progress in\u0000AI and how slowly AI is being embedded into business environments. Deploying\u0000well-performing lab models into production settings, especially in on-premise\u0000environments, often entails specialized expertise and imposes a heavy burden of\u0000model management, creating significant barriers to implementing AI models in\u0000real-world applications. KModels leverages proven libraries and platforms (Kubeflow Pipelines, KServe)\u0000to streamline AI adoption by supporting both AI developers and consumers. It\u0000allows model developers to focus solely on model development and share models\u0000as transportable units (Templates), abstracting away complex production\u0000deployment concerns. KModels enables AI consumers to eliminate the need for a\u0000dedicated data scientist, as the templates encapsulate most data science\u0000considerations while providing business-oriented control. This paper presents the architecture of KModels and the key decisions that\u0000shape it. We outline KModels' main components as well as its interfaces.\u0000Furthermore, we explain how KModels is highly suited for on-premise deployment\u0000but can also be used in cloud environments. The efficacy of KModels is demonstrated through the successful deployment of\u0000three AI models within an existing Work Order Management system. These models\u0000operate in a client's data center and are trained on local data, without data\u0000scientist intervention. One model improved the accuracy of Failure Code\u0000specification for work orders from 46% to 83%, showcasing the substantial\u0000benefit of accessible and localized AI solutions.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci, Daniel Russo
{"title":"Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering","authors":"Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci, Daniel Russo","doi":"arxiv-2409.05055","DOIUrl":"https://doi.org/arxiv-2409.05055","url":null,"abstract":"As a socio-technical activity, software development involves the close\u0000interconnection of people and technology. The integration of Large Language\u0000Models (LLMs) into this process exemplifies the socio-technical nature of\u0000software development. Although LLMs influence the development process, software\u0000development remains fundamentally human-centric, necessitating an investigation\u0000of the human factors in this adoption. Thus, with this study we explore the\u0000factors influencing the adoption of LLMs in software development, focusing on\u0000the role of professionals' cultural values. Guided by the Unified Theory of\u0000Acceptance and Use of Technology (UTAUT2) and Hofstede's cultural dimensions,\u0000we hypothesized that cultural values moderate the relationships within the\u0000UTAUT2 framework. Using Partial Least Squares-Structural Equation Modelling and\u0000data from 188 software engineers, we found that habit and performance\u0000expectancy are the primary drivers of LLM adoption, while cultural values do\u0000not significantly moderate this process. These findings suggest that, by\u0000highlighting how LLMs can boost performance and efficiency, organizations can\u0000encourage their use, no matter the cultural differences. Practical steps\u0000include offering training programs to demonstrate LLM benefits, creating a\u0000supportive environment for regular use, and continuously tracking and sharing\u0000performance improvements from using LLMs.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"102 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus
{"title":"OSS License Identification at Scale: A Comprehensive Dataset Using World of Code","authors":"Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus","doi":"arxiv-2409.04824","DOIUrl":"https://doi.org/arxiv-2409.04824","url":null,"abstract":"The proliferation of open source software (OSS) has led to a complex\u0000landscape of licensing practices, making accurate license identification\u0000crucial for legal and compliance purposes. This study presents a comprehensive\u0000analysis of OSS licenses using the World of Code (WoC) infrastructure. We\u0000employ an exhaustive approach, scanning all files containing ``license'' in\u0000their filepath, and apply the winnowing algorithm for robust text matching. Our\u0000method identifies and matches over 5.5 million distinct license blobs across\u0000millions of OSS projects, creating a detailed project-to-license (P2L) map. We\u0000verify the accuracy of our approach through stratified sampling and manual\u0000review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall\u0000of 95.45%, and an F1 score of 91.11%. This work enhances the understanding of\u0000OSS licensing practices and provides a valuable resource for developers,\u0000researchers, and legal professionals. Future work will expand the scope of\u0000license detection to include code files and references to licenses in project\u0000documentation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MILE: A Mutation Testing Framework of In-Context Learning Systems","authors":"Zeming Wei, Yihao Zhang, Meng Sun","doi":"arxiv-2409.04831","DOIUrl":"https://doi.org/arxiv-2409.04831","url":null,"abstract":"In-context Learning (ICL) has achieved notable success in the applications of\u0000large language models (LLMs). By adding only a few input-output pairs that\u0000demonstrate a new task, the LLM can efficiently learn the task during inference\u0000without modifying the model parameters. Such mysterious ability of LLMs has\u0000attracted great research interests in understanding, formatting, and improving\u0000the in-context demonstrations, while still suffering from drawbacks like\u0000black-box mechanisms and sensitivity against the selection of examples. In this\u0000work, inspired by the foundations of adopting testing techniques in machine\u0000learning (ML) systems, we propose a mutation testing framework designed to\u0000characterize the quality and effectiveness of test data for ICL systems. First,\u0000we propose several mutation operators specialized for ICL demonstrations, as\u0000well as corresponding mutation scores for ICL test sets. With comprehensive\u0000experiments, we showcase the effectiveness of our framework in evaluating the\u0000reliability and quality of ICL test suites. Our code is available at\u0000https://github.com/weizeming/MILE.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingzhe Zhang, Tong Jia, Kangjin Wang, Mengxi Jia, Yang Yong, Ying Li
{"title":"Reducing Events to Augment Log-based Anomaly Detection Models: An Empirical Study","authors":"Lingzhe Zhang, Tong Jia, Kangjin Wang, Mengxi Jia, Yang Yong, Ying Li","doi":"arxiv-2409.04834","DOIUrl":"https://doi.org/arxiv-2409.04834","url":null,"abstract":"As software systems grow increasingly intricate, the precise detection of\u0000anomalies have become both essential and challenging. Current log-based anomaly\u0000detection methods depend heavily on vast amounts of log data leading to\u0000inefficient inference and potential misguidance by noise logs. However, the\u0000quantitative effects of log reduction on the effectiveness of anomaly detection\u0000remain unexplored. Therefore, we first conduct a comprehensive study on six\u0000distinct models spanning three datasets. Through the study, the impact of log\u0000quantity and their effectiveness in representing anomalies is qualifies,\u0000uncovering three distinctive log event types that differently influence model\u0000performance. Drawing from these insights, we propose LogCleaner: an efficient\u0000methodology for the automatic reduction of log events in the context of anomaly\u0000detection. Serving as middleware between software systems and models,\u0000LogCleaner continuously updates and filters anti-events and duplicative-events\u0000in the raw generated logs. Experimental outcomes highlight LogCleaner's\u0000capability to reduce over 70% of log events in anomaly detection, accelerating\u0000the model's inference speed by approximately 300%, and universally improving\u0000the performance of models for anomaly detection.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development","authors":"Mahmoud Jahanshahi, David Reid, Audris Mockus","doi":"arxiv-2409.04830","DOIUrl":"https://doi.org/arxiv-2409.04830","url":null,"abstract":"In Open Source Software, resources of any project are open for reuse by\u0000introducing dependencies or copying the resource itself. In contrast to\u0000dependency-based reuse, the infrastructure to systematically support copy-based\u0000reuse appears to be entirely missing. Our aim is to enable future research and\u0000tool development to increase efficiency and reduce the risks of copy-based\u0000reuse. We seek a better understanding of such reuse by measuring its prevalence\u0000and identifying factors affecting the propensity to reuse. To identify reused\u0000artifacts and trace their origins, our method exploits World of Code\u0000infrastructure. We begin with a set of theory-derived factors related to the\u0000propensity to reuse, sample instances of different reuse types, and survey\u0000developers to better understand their intentions. Our results indicate that\u0000copy-based reuse is common, with many developers being aware of it when writing\u0000code. The propensity for a file to be reused varies greatly among languages and\u0000between source code and binary files, consistently decreasing over time. Files\u0000introduced by popular projects are more likely to be reused, but at least half\u0000of reused resources originate from ``small'' and ``medium'' projects.\u0000Developers had various reasons for reuse but were generally positive about\u0000using a package manager.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Costanza Alfieri, Juri Di Rocco, Phuong T. Nguyen, Paola Inverardi
{"title":"Exploring User Privacy Awareness on GitHub: An Empirical Study","authors":"Costanza Alfieri, Juri Di Rocco, Phuong T. Nguyen, Paola Inverardi","doi":"arxiv-2409.04048","DOIUrl":"https://doi.org/arxiv-2409.04048","url":null,"abstract":"GitHub provides developers with a practical way to distribute source code and\u0000collaboratively work on common projects. To enhance account security and\u0000privacy, GitHub allows its users to manage access permissions, review audit\u0000logs, and enable two-factor authentication. However, despite the endless\u0000effort, the platform still faces various issues related to the privacy of its\u0000users. This paper presents an empirical study delving into the GitHub\u0000ecosystem. Our focus is on investigating the utilization of privacy settings on\u0000the platform and identifying various types of sensitive information disclosed\u0000by users. Leveraging a dataset comprising 6,132 developers, we report and\u0000analyze their activities by means of comments on pull requests. Our findings\u0000indicate an active engagement by users with the available privacy settings on\u0000GitHub. Notably, we observe the disclosure of different forms of private\u0000information within pull request comments. This observation has prompted our\u0000exploration into sensitivity detection using a large language model and BERT,\u0000to pave the way for a personalized privacy assistant. Our work provides\u0000insights into the utilization of existing privacy protection tools, such as\u0000privacy settings, along with their inherent limitations. Essentially, we aim to\u0000advance research in this field by providing both the motivation for creating\u0000such privacy protection tools and a proposed methodology for personalizing\u0000them.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Buggy Contracts via Smart Testing","authors":"Sally Junsong Wang, Jianan Yao, Kexin Pei, Hidedaki Takahashi, Junfeng Yang","doi":"arxiv-2409.04597","DOIUrl":"https://doi.org/arxiv-2409.04597","url":null,"abstract":"Smart contracts are susceptible to critical vulnerabilities. Hybrid dynamic\u0000analyses, such as concolic execution assisted fuzzing and foundation model\u0000assisted fuzzing, have emerged as highly effective testing techniques for smart\u0000contract bug detection recently. This hybrid approach has shown initial promise\u0000in real-world benchmarks, but it still suffers from low scalability to find\u0000deep bugs buried in complex code patterns. We observe that performance\u0000bottlenecks of existing dynamic analyses and model hallucination are two main\u0000factors limiting the scalability of this hybrid approach in finding deep bugs. To overcome the challenges, we design an interactive, self-deciding\u0000foundation model based system, called SmartSys, to support hybrid smart\u0000contract dynamic analyses. The key idea is to teach foundation models about\u0000performance bottlenecks of different dynamic analysis techniques, making it\u0000possible to forecast the right technique and generates effective fuzz targets\u0000that can reach deep, hidden bugs. To prune hallucinated, incorrect fuzz\u0000targets, SmartSys feeds foundation models with feedback from dynamic analysis\u0000during compilation and at runtime. The interesting results of SmartSys include: i) discovering a smart contract\u0000protocol vulnerability that has escaped eleven tools and survived multiple\u0000audits for over a year; ii) improving coverage by up to 14.3% on real-world\u0000benchmarks compared to the baselines.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Mayer, Christian Heumann, Matthias Aßenmacher
{"title":"Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation","authors":"Luis Mayer, Christian Heumann, Matthias Aßenmacher","doi":"arxiv-2409.04164","DOIUrl":"https://doi.org/arxiv-2409.04164","url":null,"abstract":"In recent years, large language models (LLMs) have emerged as powerful tools\u0000with potential applications in various fields, including software engineering.\u0000Within the scope of this research, we evaluate five different state-of-the-art\u0000LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their\u0000capabilities for text-to-code generation. In an empirical study, we feed\u0000prompts with textual descriptions of coding problems sourced from the\u0000programming website LeetCode to the models with the task of creating solutions\u0000in Python. Subsequently, the quality of the generated outputs is assessed using\u0000the testing functionalities of LeetCode. The results indicate large differences\u0000in performance between the investigated models. ChatGPT can handle these\u0000typical programming challenges by far the most effectively, surpassing even\u0000code-specialized models like Code Llama. To gain further insights, we measure\u0000the runtime as well as the memory usage of the generated outputs and compared\u0000them to the other code submissions on Leetcode. A detailed error analysis,\u0000encompassing a comparison of the differences concerning correct indentation and\u0000form of the generated code as well as an assignment of the incorrectly solved\u0000tasks to certain error categories allows us to obtain a more nuanced picture of\u0000the results and potential for improvement. The results also show a clear\u0000pattern of increasingly incorrect produced code when the models are facing a\u0000lot of context in the form of longer prompts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"438 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}