{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":"https://doi.org/arxiv-2409.10081","url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\u0000to automate impactful decisions. Unfortunately, these applications often fall\u0000short of adequately managing critical data and complying with upcoming\u0000regulations. A technical reason for the persistence of these issues is that the\u0000data pipelines in common ML libraries and cloud services lack fundamental\u0000declarative, data-centric abstractions. Recent research has shown how such\u0000abstractions enable techniques like provenance tracking and automatic\u0000inspection to help manage ML pipelines. Unfortunately, these approaches lack\u0000adoption in the real world because they require clean ML pipeline code written\u0000with declarative APIs, instead of the messy imperative Python code that data\u0000scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\u0000established development practices. Instead, we propose to circumvent this \"code\u0000abstraction gap\" by leveraging the code generation capabilities of large\u0000language models (LLMs). Our idea is to rewrite messy data science code to a\u0000custom-tailored declarative pipeline abstraction, which we implement as a\u0000proof-of-concept in our prototype Lester. We detail its application for a\u0000challenging compliance management example involving \"incremental view\u0000maintenance\" of deployed ML pipelines. The code rewrites for our running\u0000example show the potential of LLMs to make messy data science code declarative,\u0000e.g., by identifying hand-coded joins in Python and turning them into joins on\u0000dataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of Data Evaluation Benchmark for Data Wrangling Recommendation System","authors":"Yuqing Wang, Anna Fariha","doi":"arxiv-2409.10635","DOIUrl":"https://doi.org/arxiv-2409.10635","url":null,"abstract":"CoWrangler is a data-wrangling recommender system designed to streamline data\u0000processing tasks. Recognizing that data processing is often time-consuming and\u0000complex for novice users, we aim to simplify the decision-making process\u0000regarding the most effective subsequent data operation. By analyzing over\u000010,000 Kaggle notebooks spanning approximately 1,000 datasets, we derive\u0000insights into common data processing strategies employed by users across\u0000various tasks. This analysis helps us understand how dataset quality influences\u0000wrangling operations, informing our ongoing efforts to possibly expand our\u0000dataset sources in the future.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Adaptive Bulk Loading of Multidimensional Points","authors":"Moin Hussain Moti, Dimitris Papadias","doi":"arxiv-2409.09447","DOIUrl":"https://doi.org/arxiv-2409.09447","url":null,"abstract":"Existing methods for bulk loading disk-based multidimensional points involve\u0000multiple applications of external sorting. In this paper, we propose techniques\u0000that apply linear scan, and are therefore significantly faster. The resulting\u0000FMBI Index possesses several desirable properties, including almost full and\u0000square nodes with zero overlap, and has excellent query performance. As a\u0000second contribution, we develop an adaptive version AMBI, which utilizes the\u0000query workload to build a partial index only for parts of the data space that\u0000contain query results. Finally, we extend FMBI and AMBI to parallel bulk\u0000loading and query processing in distributed systems. An extensive experimental\u0000evaluation with real datasets confirms that FMBI and AMBI clearly outperform\u0000competitors in terms of combined index construction and query processing cost,\u0000sometimes by orders of magnitude.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Matrix Profile for Anomaly Detection on Multidimensional Time Series","authors":"Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh","doi":"arxiv-2409.09298","DOIUrl":"https://doi.org/arxiv-2409.09298","url":null,"abstract":"The Matrix Profile (MP), a versatile tool for time series data mining, has\u0000been shown effective in time series anomaly detection (TSAD). This paper delves\u0000into the problem of anomaly detection in multidimensional time series, a common\u0000occurrence in real-world applications. For instance, in a manufacturing\u0000factory, multiple sensors installed across the site collect time-varying data\u0000for analysis. The Matrix Profile, named for its role in profiling the matrix\u0000storing pairwise distance between subsequences of univariate time series,\u0000becomes complex in multidimensional scenarios. If the input univariate time\u0000series has n subsequences, the pairwise distance matrix is a n x n matrix. In a\u0000multidimensional time series with d dimensions, the pairwise distance\u0000information must be stored in a n x n x d tensor. In this paper, we first\u0000analyze different strategies for condensing this tensor into a profile vector.\u0000We then investigate the potential of extending the MP to efficiently find\u0000k-nearest neighbors for anomaly detection. Finally, we benchmark the\u0000multidimensional MP against 19 baseline methods on 119 multidimensional TSAD\u0000datasets. The experiments covers three learning setups: unsupervised,\u0000supervised, and semi-supervised. MP is the only method that consistently\u0000delivers high performance across all setups.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Systematic Review on Process Mining for Curricular Analysis","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09204","DOIUrl":"https://doi.org/arxiv-2409.09204","url":null,"abstract":"Educational Process Mining (EPM) is a data analysis technique that is used to\u0000improve educational processes. It is based on Process Mining (PM), which\u0000involves gathering records (logs) of events to discover process models and\u0000analyze the data from a process-centric perspective. One specific application\u0000of EPM is curriculum mining, which focuses on understanding the learning\u0000program students follow to achieve educational goals. This is important for\u0000institutional curriculum decision-making and quality improvement. Therefore,\u0000academic institutions can benefit from organizing the existing techniques,\u0000capabilities, and limitations. We conducted a systematic literature review to\u0000identify works on applying PM to curricular analysis and provide insights for\u0000further research. From the analysis of 22 primary studies, we found that\u0000results can be classified into five categories concerning the objectives they\u0000pursue: the discovery of educational trajectories, the identification of\u0000deviations in the observed behavior of students, the analysis of bottlenecks,\u0000the analysis of stopout and dropout problems, and the generation of\u0000recommendation. Moreover, we identified some open challenges and opportunities,\u0000such as standardizing for replicating studies to perform cross-university\u0000curricular analysis and strengthening the connection between PM and data mining\u0000for improving curricular analysis.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending predictive process monitoring for collaborative processes","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09212","DOIUrl":"https://doi.org/arxiv-2409.09212","url":null,"abstract":"Process mining on business process execution data has focused primarily on\u0000orchestration-type processes performed in a single organization\u0000(intra-organizational). Collaborative (inter-organizational) processes, unlike\u0000those of orchestration type, expand several organizations (for example, in\u0000e-Government), adding complexity and various challenges both for their\u0000implementation and for their discovery, prediction, and analysis of their\u0000execution. Predictive process monitoring is based on exploiting execution data\u0000from past instances to predict the execution of current cases. It is possible\u0000to make predictions on the next activity and remaining time, among others, to\u0000anticipate possible deviations, violations, and delays in the processes to take\u0000preventive measures (e.g., re-allocation of resources). In this work, we\u0000propose an extension for collaborative processes of traditional process\u0000prediction, considering particularities of this type of process, which add\u0000information of interest in this context, for example, the next activity of\u0000which participant or the following message to be exchanged between two\u0000participants.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPconv: Super-Polynomially Faster Join Ordering","authors":"Mihail Stoian, Andreas Kipf","doi":"arxiv-2409.08013","DOIUrl":"https://doi.org/arxiv-2409.08013","url":null,"abstract":"We revisit the join ordering problem in query optimization. The standard\u0000exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is\u0000prohibitively expensive for large queries, which are not that uncommon anymore.\u0000We develop a new algorithmic framework based on subset convolution. DPconv\u0000achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$\u0000time-barrier for the first time. We show that the instantiation of our\u0000framework for the $C_max$ cost function is up to 30x faster than DPccp for\u0000large clique queries.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald
{"title":"Ranked Enumeration for Database Queries","authors":"Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald","doi":"arxiv-2409.08142","DOIUrl":"https://doi.org/arxiv-2409.08142","url":null,"abstract":"Ranked enumeration is a query-answering paradigm where the query answers are\u0000returned incrementally in order of importance (instead of returning all answers\u0000at once). Importance is defined by a ranking function that can be specific to\u0000the application, but typically involves either a lexicographic order (e.g.,\u0000\"ORDER BY R.A, S.B\" in SQL) or a weighted sum of attributes (e.g., \"ORDER BY\u00003*R.A + 2*S.B\"). We recently introduced any-k algorithms for (multi-way) join\u0000queries, which push ranking into joins and avoid materializing intermediate\u0000results until necessary. The top-ranked answers are returned asymptotically\u0000faster than the common join-then-rank approach of database systems, resulting\u0000in orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a long\u0000line of theoretical research on unranked enumeration, where answers are also\u0000returned incrementally, but with no explicit ordering requirement. For a broad\u0000class of ranking functions with certain monotonicity properties, including\u0000lexicographic orders and sum-based rankings, the ordering requirement\u0000surprisingly does not increase the asymptotic time or space complexity, apart\u0000from logarithmic factors. A key insight of our work is the connection between ranked enumeration for\u0000database queries and the fundamental task of computing the kth-shortest path in\u0000a graph. Uncovering these connections allowed us to ground our approach in the\u0000rich literature of that problem and connect ideas that had been explored in\u0000isolation before. In this article, we adopt a pragmatic approach and present a\u0000slightly simplified version of the algorithm without the shortest-path\u0000interpretation. We believe that this will benefit practitioners looking to\u0000implement and optimize any-k approaches.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah
{"title":"meds_reader: A fast and efficient EHR processing library","authors":"Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah","doi":"arxiv-2409.09095","DOIUrl":"https://doi.org/arxiv-2409.09095","url":null,"abstract":"The growing demand for machine learning in healthcare requires processing\u0000increasingly large electronic health record (EHR) datasets, but existing\u0000pipelines are not computationally efficient or scalable. In this paper, we\u0000introduce meds_reader, an optimized Python package for efficient EHR data\u0000processing that is designed to take advantage of many intrinsic properties of\u0000EHR data for improved speed. We then demonstrate the benefits of meds_reader by\u0000reimplementing key components of two major EHR processing pipelines, achieving\u000010-100x improvements in memory, speed, and disk usage. The code for meds_reader\u0000can be found at https://github.com/som-shahlab/meds_reader.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth
{"title":"echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions","authors":"Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth","doi":"arxiv-2409.07083","DOIUrl":"https://doi.org/arxiv-2409.07083","url":null,"abstract":"According to the FAIR (findability, accessibility, interoperability, and\u0000reusability) principles, scientific data should always be stored with\u0000machine-readable descriptive metadata. Existing solutions to store data with\u0000metadata, such as electronic lab notebooks (ELN), are often very\u0000domain-specific and not sufficiently generic for arbitrary experimental or\u0000computational results. In this work, we present open-source echemdb toolkit for creating and\u0000handling data and metadata. The toolkit is running entirely on the file system\u0000level using a file-based approach, which facilitates integration with other\u0000tools in a FAIR data life cycle and means that no complicated server setup is\u0000required. This also makes the toolkit more accessible to the average researcher\u0000since no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automatic\u0000annotation of raw research data with human- and machine-readable metadata, data\u0000conversion into standardised frictionless Data Packages, and an API for\u0000exploring the data. We also illustrate the web frameworks to illustrate the\u0000data using example data from research into energy conversion and storage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}