Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-06-03 DOI:10.1109/TSE.2025.3574500

Yiran Wang;Willem Meijer;José Antonio Hernández López;Ulf Nilsson;Dániel Varró

{"title":"Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks","authors":"Yiran Wang;Willem Meijer;José Antonio Hernández López;Ulf Nilsson;Dániel Varró","doi":"10.1109/TSE.2025.3574500","DOIUrl":null,"url":null,"abstract":"Jupyter notebooks have become central in data science, integrating code, text and output in a flexible environment. With the rise of machine learning (ML), notebooks are increasingly used for prototyping and data analysis. However, due to their dependence on complex ML libraries and the flexible notebook semantics that allow cells to be run in any order, notebooks are susceptible to software bugs that may lead to program crashes. This paper presents a comprehensive empirical study focusing on crashes in publicly available Python ML notebooks. We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle, and manually analyze a sample of 746 crashes across various aspects, including crash types and root causes. Our analysis identifies unique ML-specific crash types, such as tensor shape mismatches and dataset value errors that violate API constraints. Additionally, we highlight unique root causes tied to notebook semantics, including out-of-order execution and residual errors from previous cells, which have been largely overlooked in prior research. Furthermore, we identify the most error-prone ML libraries, and analyze crash distribution across ML pipeline stages. We find that over 40% of crashes stem from API misuse and notebook-specific issues. Crashes frequently occur when using ML libraries like TensorFlow/Keras and Torch. Additionally, over 70% of the crashes occur during data preparation, model training, and evaluation or prediction stages of the ML pipeline, while data visualization errors tend to be unique to ML notebooks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 7","pages":"2181-2196"},"PeriodicalIF":5.6000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11022755","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11022755/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Jupyter notebooks have become central in data science, integrating code, text and output in a flexible environment. With the rise of machine learning (ML), notebooks are increasingly used for prototyping and data analysis. However, due to their dependence on complex ML libraries and the flexible notebook semantics that allow cells to be run in any order, notebooks are susceptible to software bugs that may lead to program crashes. This paper presents a comprehensive empirical study focusing on crashes in publicly available Python ML notebooks. We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle, and manually analyze a sample of 746 crashes across various aspects, including crash types and root causes. Our analysis identifies unique ML-specific crash types, such as tensor shape mismatches and dataset value errors that violate API constraints. Additionally, we highlight unique root causes tied to notebook semantics, including out-of-order execution and residual errors from previous cells, which have been largely overlooked in prior research. Furthermore, we identify the most error-prone ML libraries, and analyze crash distribution across ML pipeline stages. We find that over 40% of crashes stem from API misuse and notebook-specific issues. Crashes frequently occur when using ML libraries like TensorFlow/Keras and Torch. Additionally, over 70% of the crashes occur during data preparation, model training, and evaluation or prediction stages of the ML pipeline, while data visualization errors tend to be unique to ML notebooks.

查看原文本刊更多论文

为什么机器学习笔记本会崩溃？公共Python Jupyter笔记本的实证研究

Jupyter笔记本已经成为数据科学的核心，在一个灵活的环境中集成了代码、文本和输出。随着机器学习（ML）的兴起，笔记本电脑越来越多地用于原型设计和数据分析。然而，由于它们依赖于复杂的ML库和允许以任何顺序运行单元的灵活的笔记本语义，笔记本容易受到可能导致程序崩溃的软件错误的影响。本文提出了一项全面的实证研究，重点关注公开可用的Python ML笔记本中的崩溃。我们从GitHub和Kaggle收集了64,031个笔记本，其中包含92,542个崩溃，并从各个方面手动分析了746个崩溃样本，包括崩溃类型和根本原因。我们的分析确定了独特的ml特定崩溃类型，例如违反API约束的张量形状不匹配和数据集值错误。此外，我们强调了与笔记本语义相关的独特根本原因，包括无序执行和先前单元的残留错误，这些在先前的研究中很大程度上被忽视了。此外，我们确定了最容易出错的ML库，并分析了跨ML管道阶段的崩溃分布。我们发现超过40%的崩溃源于API误用和笔记本特有的问题。当使用TensorFlow/Keras和Torch等ML库时，经常会发生崩溃。此外，超过70%的崩溃发生在ML管道的数据准备、模型训练和评估或预测阶段，而数据可视化错误往往是ML笔记本所特有的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.