arXiv - CS - Databases最新文献

Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! 杂乱的代码让 ML 管道管理变得困难？只需让 LLM 重写代码！

arXiv - CS - Databases Pub Date : 2024-09-16 DOI: arxiv-2409.10081

Sebastian Schelter, Stefan Grafberger

{"title":"Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code!","authors":"Sebastian Schelter, Stefan Grafberger","doi":"arxiv-2409.10081","DOIUrl":"https://doi.org/arxiv-2409.10081","url":null,"abstract":"Machine learning (ML) applications that learn from data are increasingly used\u0000to automate impactful decisions. Unfortunately, these applications often fall\u0000short of adequately managing critical data and complying with upcoming\u0000regulations. A technical reason for the persistence of these issues is that the\u0000data pipelines in common ML libraries and cloud services lack fundamental\u0000declarative, data-centric abstractions. Recent research has shown how such\u0000abstractions enable techniques like provenance tracking and automatic\u0000inspection to help manage ML pipelines. Unfortunately, these approaches lack\u0000adoption in the real world because they require clean ML pipeline code written\u0000with declarative APIs, instead of the messy imperative Python code that data\u0000scientists typically write for data preparation. We argue that it is unrealistic to expect data scientists to change their\u0000established development practices. Instead, we propose to circumvent this \"code\u0000abstraction gap\" by leveraging the code generation capabilities of large\u0000language models (LLMs). Our idea is to rewrite messy data science code to a\u0000custom-tailored declarative pipeline abstraction, which we implement as a\u0000proof-of-concept in our prototype Lester. We detail its application for a\u0000challenging compliance management example involving \"incremental view\u0000maintenance\" of deployed ML pipelines. The code rewrites for our running\u0000example show the potential of LLMs to make messy data science code declarative,\u0000e.g., by identifying hand-coded joins in Python and turning them into joins on\u0000dataframes, or by generating declarative feature encoders from NumPy code.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System 为数据整理推荐系统开发数据评估基准

arXiv - CS - Databases Pub Date : 2024-09-16 DOI: arxiv-2409.10635

Yuqing Wang, Anna Fariha

引用次数: 0

Fast and Adaptive Bulk Loading of Multidimensional Points 多维点的快速自适应批量加载

arXiv - CS - Databases Pub Date : 2024-09-14 DOI: arxiv-2409.09447

Moin Hussain Moti, Dimitris Papadias

引用次数: 0

Matrix Profile for Anomaly Detection on Multidimensional Time Series 用于多维时间序列异常检测的矩阵剖面图

arXiv - CS - Databases Pub Date : 2024-09-14 DOI: arxiv-2409.09298

Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh

{"title":"Matrix Profile for Anomaly Detection on Multidimensional Time Series","authors":"Chin-Chia Michael Yeh, Audrey Der, Uday Singh Saini, Vivian Lai, Yan Zheng, Junpeng Wang, Xin Dai, Zhongfang Zhuang, Yujie Fan, Huiyuan Chen, Prince Osei Aboagye, Liang Wang, Wei Zhang, Eamonn Keogh","doi":"arxiv-2409.09298","DOIUrl":"https://doi.org/arxiv-2409.09298","url":null,"abstract":"The Matrix Profile (MP), a versatile tool for time series data mining, has\u0000been shown effective in time series anomaly detection (TSAD). This paper delves\u0000into the problem of anomaly detection in multidimensional time series, a common\u0000occurrence in real-world applications. For instance, in a manufacturing\u0000factory, multiple sensors installed across the site collect time-varying data\u0000for analysis. The Matrix Profile, named for its role in profiling the matrix\u0000storing pairwise distance between subsequences of univariate time series,\u0000becomes complex in multidimensional scenarios. If the input univariate time\u0000series has n subsequences, the pairwise distance matrix is a n x n matrix. In a\u0000multidimensional time series with d dimensions, the pairwise distance\u0000information must be stored in a n x n x d tensor. In this paper, we first\u0000analyze different strategies for condensing this tensor into a profile vector.\u0000We then investigate the potential of extending the MP to efficiently find\u0000k-nearest neighbors for anomaly detection. Finally, we benchmark the\u0000multidimensional MP against 19 baseline methods on 119 multidimensional TSAD\u0000datasets. The experiments covers three learning setups: unsupervised,\u0000supervised, and semi-supervised. MP is the only method that consistently\u0000delivers high performance across all setups.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"213 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extending predictive process monitoring for collaborative processes 为协作流程扩展预测性流程监控

arXiv - CS - Databases Pub Date : 2024-09-13 DOI: arxiv-2409.09212

Daniel Calegari, Andrea Delgado

{"title":"Extending predictive process monitoring for collaborative processes","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09212","DOIUrl":"https://doi.org/arxiv-2409.09212","url":null,"abstract":"Process mining on business process execution data has focused primarily on\u0000orchestration-type processes performed in a single organization\u0000(intra-organizational). Collaborative (inter-organizational) processes, unlike\u0000those of orchestration type, expand several organizations (for example, in\u0000e-Government), adding complexity and various challenges both for their\u0000implementation and for their discovery, prediction, and analysis of their\u0000execution. Predictive process monitoring is based on exploiting execution data\u0000from past instances to predict the execution of current cases. It is possible\u0000to make predictions on the next activity and remaining time, among others, to\u0000anticipate possible deviations, violations, and delays in the processes to take\u0000preventive measures (e.g., re-allocation of resources). In this work, we\u0000propose an extension for collaborative processes of traditional process\u0000prediction, considering particularities of this type of process, which add\u0000information of interest in this context, for example, the next activity of\u0000which participant or the following message to be exchanged between two\u0000participants.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Systematic Review on Process Mining for Curricular Analysis 课程分析过程挖掘系统综述

arXiv - CS - Databases Pub Date : 2024-09-13 DOI: arxiv-2409.09204

Daniel Calegari, Andrea Delgado

{"title":"A Systematic Review on Process Mining for Curricular Analysis","authors":"Daniel Calegari, Andrea Delgado","doi":"arxiv-2409.09204","DOIUrl":"https://doi.org/arxiv-2409.09204","url":null,"abstract":"Educational Process Mining (EPM) is a data analysis technique that is used to\u0000improve educational processes. It is based on Process Mining (PM), which\u0000involves gathering records (logs) of events to discover process models and\u0000analyze the data from a process-centric perspective. One specific application\u0000of EPM is curriculum mining, which focuses on understanding the learning\u0000program students follow to achieve educational goals. This is important for\u0000institutional curriculum decision-making and quality improvement. Therefore,\u0000academic institutions can benefit from organizing the existing techniques,\u0000capabilities, and limitations. We conducted a systematic literature review to\u0000identify works on applying PM to curricular analysis and provide insights for\u0000further research. From the analysis of 22 primary studies, we found that\u0000results can be classified into five categories concerning the objectives they\u0000pursue: the discovery of educational trajectories, the identification of\u0000deviations in the observed behavior of students, the analysis of bottlenecks,\u0000the analysis of stopout and dropout problems, and the generation of\u0000recommendation. Moreover, we identified some open challenges and opportunities,\u0000such as standardizing for replicating studies to perform cross-university\u0000curricular analysis and strengthening the connection between PM and data mining\u0000for improving curricular analysis.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DPconv: Super-Polynomially Faster Join Ordering DPconv：超快连接排序

arXiv - CS - Databases Pub Date : 2024-09-12 DOI: arxiv-2409.08013

Mihail Stoian, Andreas Kipf

引用次数: 0

Ranked Enumeration for Database Queries 数据库查询的排序枚举

arXiv - CS - Databases Pub Date : 2024-09-12 DOI: arxiv-2409.08142

Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

{"title":"Ranked Enumeration for Database Queries","authors":"Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald","doi":"arxiv-2409.08142","DOIUrl":"https://doi.org/arxiv-2409.08142","url":null,"abstract":"Ranked enumeration is a query-answering paradigm where the query answers are\u0000returned incrementally in order of importance (instead of returning all answers\u0000at once). Importance is defined by a ranking function that can be specific to\u0000the application, but typically involves either a lexicographic order (e.g.,\u0000\"ORDER BY R.A, S.B\" in SQL) or a weighted sum of attributes (e.g., \"ORDER BY\u00003*R.A + 2*S.B\"). We recently introduced any-k algorithms for (multi-way) join\u0000queries, which push ranking into joins and avoid materializing intermediate\u0000results until necessary. The top-ranked answers are returned asymptotically\u0000faster than the common join-then-rank approach of database systems, resulting\u0000in orders-of-magnitude speedup in practice. In addition to their practical usefulness, our techniques complement a long\u0000line of theoretical research on unranked enumeration, where answers are also\u0000returned incrementally, but with no explicit ordering requirement. For a broad\u0000class of ranking functions with certain monotonicity properties, including\u0000lexicographic orders and sum-based rankings, the ordering requirement\u0000surprisingly does not increase the asymptotic time or space complexity, apart\u0000from logarithmic factors. A key insight of our work is the connection between ranked enumeration for\u0000database queries and the fundamental task of computing the kth-shortest path in\u0000a graph. Uncovering these connections allowed us to ground our approach in the\u0000rich literature of that problem and connect ideas that had been explored in\u0000isolation before. In this article, we adopt a pragmatic approach and present a\u0000slightly simplified version of the algorithm without the shortest-path\u0000interpretation. We believe that this will benefit practitioners looking to\u0000implement and optimize any-k approaches.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

meds_reader: A fast and efficient EHR processing library meds_reader：快速高效的电子病历处理库

arXiv - CS - Databases Pub Date : 2024-09-12 DOI: arxiv-2409.09095

Ethan Steinberg, Michael Wornow, Suhana Bedi, Jason Alan Fries, Matthew B. A. McDermott, Nigam H. Shah

引用次数: 0

echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions echemdb 工具包 -- 为数据管理解决方案准备数据的轻量级方法

arXiv - CS - Databases Pub Date : 2024-09-11 DOI: arxiv-2409.07083

Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth

{"title":"echemdb Toolkit -- a Lightweight Approach to Getting Data Ready for Data Management Solutions","authors":"Albert K. Engstfeld, Johannes M. Hermann, Nicolas G. Hörmann, Julian Rüth","doi":"arxiv-2409.07083","DOIUrl":"https://doi.org/arxiv-2409.07083","url":null,"abstract":"According to the FAIR (findability, accessibility, interoperability, and\u0000reusability) principles, scientific data should always be stored with\u0000machine-readable descriptive metadata. Existing solutions to store data with\u0000metadata, such as electronic lab notebooks (ELN), are often very\u0000domain-specific and not sufficiently generic for arbitrary experimental or\u0000computational results. In this work, we present open-source echemdb toolkit for creating and\u0000handling data and metadata. The toolkit is running entirely on the file system\u0000level using a file-based approach, which facilitates integration with other\u0000tools in a FAIR data life cycle and means that no complicated server setup is\u0000required. This also makes the toolkit more accessible to the average researcher\u0000since no understanding of more sophisticated database technologies is required. We showcase several aspects and applications of the toolkit: automatic\u0000annotation of raw research data with human- and machine-readable metadata, data\u0000conversion into standardised frictionless Data Packages, and an API for\u0000exploring the data. We also illustrate the web frameworks to illustrate the\u0000data using example data from research into energy conversion and storage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0