An exploratory study on automatic identification of assumptions in the development of deep learning frameworks

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Science of Computer Programming Pub Date : 2024-10-03 DOI:10.1016/j.scico.2024.103218

Chen Yang , Peng Liang , Zinan Ma

{"title":"An exploratory study on automatic identification of assumptions in the development of deep learning frameworks","authors":"Chen Yang , Peng Liang , Zinan Ma","doi":"10.1016/j.scico.2024.103218","DOIUrl":null,"url":null,"abstract":"<div><h3>Context</h3><div>Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources).</div></div><div><h3>Objective</h3><div>The objective of the study is to evaluate different classification models for the purpose of identification with respect to assumptions from the point of view of developers and users in the context of DL framework projects (i.e., issues, pull requests, and commits) on GitHub.</div></div><div><h3>Method</h3><div>First, we constructed a new and largest dataset (i.e., the AssuEval dataset) of assumptions collected from the TensorFlow and Keras repositories on GitHub. Then we explored the performance of seven non-transformers based models (e.g., Support Vector Machine, Classification and Regression Trees), the ALBERT model, and three decoder-only models (i.e., ChatGPT, Claude, and Gemini) for identifying assumptions on the AssuEval dataset.</div></div><div><h3>Results</h3><div>The study results show that ALBERT achieves the best performance (f1-score: 0.9584) for identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.8858, achieved by the Claude 3.5 Sonnet model). Though ChatGPT, Claude, and Gemini are popular models, we do not recommend using them to identify assumptions in DL framework development because of their low performance. Fine-tuning ChatGPT, Claude, Gemini, or other language models (e.g., Llama3, Falcon, and BLOOM) specifically for assumptions might improve their performance for assumption identification.</div></div><div><h3>Conclusions</h3><div>This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps researchers and practitioners better understand assumptions and how to manage them in their projects (e.g., selection of classification models for identifying assumptions).</div></div>","PeriodicalId":49561,"journal":{"name":"Science of Computer Programming","volume":"240 ","pages":"Article 103218"},"PeriodicalIF":1.4000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of Computer Programming","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167642324001412","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Context

Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs (e.g., time and resources).

Objective

The objective of the study is to evaluate different classification models for the purpose of identification with respect to assumptions from the point of view of developers and users in the context of DL framework projects (i.e., issues, pull requests, and commits) on GitHub.

Method

First, we constructed a new and largest dataset (i.e., the AssuEval dataset) of assumptions collected from the TensorFlow and Keras repositories on GitHub. Then we explored the performance of seven non-transformers based models (e.g., Support Vector Machine, Classification and Regression Trees), the ALBERT model, and three decoder-only models (i.e., ChatGPT, Claude, and Gemini) for identifying assumptions on the AssuEval dataset.

Results

The study results show that ALBERT achieves the best performance (f1-score: 0.9584) for identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.8858, achieved by the Claude 3.5 Sonnet model). Though ChatGPT, Claude, and Gemini are popular models, we do not recommend using them to identify assumptions in DL framework development because of their low performance. Fine-tuning ChatGPT, Claude, Gemini, or other language models (e.g., Llama3, Falcon, and BLOOM) specifically for assumptions might improve their performance for assumption identification.

Conclusions

This study provides researchers with the largest dataset of assumptions for further research (e.g., assumption classification, evaluation, and reasoning) and helps researchers and practitioners better understand assumptions and how to manage them in their projects (e.g., selection of classification models for identifying assumptions).

查看原文本刊更多论文

关于自动识别深度学习框架开发中的假设的探索性研究

背景利益相关者在开发深度学习（DL）框架的过程中不断做出假设。这些假设与各种类型的软件工件（如需求、设计决策和技术债务）相关，可能会被证明是无效的，从而导致系统故障。现有的假设管理方法和工具通常依赖于人工识别假设。然而，假设分散在 DL 框架开发的各种来源（如代码注释、提交、拉取请求和问题）中，手动识别假设的成本很高（如时间和资源）、方法首先，我们构建了一个新的、最大的数据集（即 AssuEval 数据集），其中包含从 GitHub 上的 TensorFlow 和 Keras 存储库中收集的假设。然后，我们探索了七个基于非变换器的模型（如支持向量机、分类树和回归树）、ALBERT 模型和三个纯解码器模型（即 ChatGPT、Claude、Gemma）的性能、研究结果表明，ALBERT 在 AssuEval 数据集上识别假设的性能最好（f1 分数：0.9584），远远优于其他模型（第二好的 f1 分数是 0.8858，由 Claude 3.5 Sonnet 模型实现）。虽然 ChatGPT、Claude 和 Gemini 是流行的模型，但由于它们的性能较低，我们不建议在 DL 框架开发中使用它们来识别假设。针对假设对 ChatGPT、Claude、Gemini 或其他语言模型（如 Llama3、Falcon 和 BLOOM）进行微调，可能会提高它们在假设识别方面的性能。结论这项研究为研究人员提供了最大的假设数据集，以用于进一步研究（如假设分类、评估和推理），并帮助研究人员和从业人员更好地理解假设以及如何在项目中管理假设（如选择用于识别假设的分类模型）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Science of Computer Programming 工程技术-计算机：软件工程

CiteScore

3.80

自引率

0.00%

发文量

审稿时长

67 days

期刊介绍： Science of Computer Programming is dedicated to the distribution of research results in the areas of software systems development, use and maintenance, including the software aspects of hardware design. The journal has a wide scope ranging from the many facets of methodological foundations to the details of technical issues andthe aspects of industrial practice. The subjects of interest to SCP cover the entire spectrum of methods for the entire life cycle of software systems, including • Requirements, specification, design, validation, verification, coding, testing, maintenance, metrics and renovation of software; • Design, implementation and evaluation of programming languages; • Programming environments, development tools, visualisation and animation; • Management of the development process; • Human factors in software, software for social interaction, software for social computing; • Cyber physical systems, and software for the interaction between the physical and the machine; • Software aspects of infrastructure services, system administration, and network management.