{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\nto automate impactful decisions. Unfortunately, these applications often fall\nshort of adequately managing critical data and complying with upcoming\nregulations. A technical reason for the persistence of these issues is that the\ndata pipelines in common ML libraries and cloud services lack fundamental\ndeclarative, data-centric abstractions. Recent research has shown how such\nabstractions enable techniques like provenance tracking and automatic\ninspection to help manage ML pipelines. Unfortunately, these approaches lack\nadoption in the real world because they require clean ML pipeline code written\nwith declarative APIs, instead of the messy imperative Python code that data\nscientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\nestablished development practices. Instead, we propose to circumvent this \"code\nabstraction gap\" by leveraging the code generation capabilities of large\nlanguage models (LLMs). Our idea is to rewrite messy data science code to a\ncustom-tailored declarative pipeline abstraction, which we implement as a\nproof-of-concept in our prototype Lester. We detail its application for a\nchallenging compliance management example involving \"incremental view\nmaintenance\" of deployed ML pipelines. The code rewrites for our running\nexample show the potential of LLMs to make messy data science code declarative,\ne.g., by identifying hand-coded joins in Python and turning them into joins on\ndataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) applications that learn from data are increasingly used
to automate impactful decisions. Unfortunately, these applications often fall
short of adequately managing critical data and complying with upcoming
regulations. A technical reason for the persistence of these issues is that the
data pipelines in common ML libraries and cloud services lack fundamental
declarative, data-centric abstractions. Recent research has shown how such
abstractions enable techniques like provenance tracking and automatic
inspection to help manage ML pipelines. Unfortunately, these approaches lack
adoption in the real world because they require clean ML pipeline code written
with declarative APIs, instead of the messy imperative Python code that data
scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their
established development practices. Instead, we propose to circumvent this "code
abstraction gap" by leveraging the code generation capabilities of large
language models (LLMs). Our idea is to rewrite messy data science code to a
custom-tailored declarative pipeline abstraction, which we implement as a
proof-of-concept in our prototype Lester. We detail its application for a
challenging compliance management example involving "incremental view
maintenance" of deployed ML pipelines. The code rewrites for our running
example show the potential of LLMs to make messy data science code declarative,
e.g., by identifying hand-coded joins in Python and turning them into joins on
dataframes, or by generating declarative feature encoders from NumPy code.