Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)最新文献

筛选
英文 中文
Modelling Machine Learning Algorithms on Relational Data with Datalog 关系型数据的机器学习算法建模
Nantia Makrynioti, N. Vasiloglou, E. Pasalic, V. Vassalos
{"title":"Modelling Machine Learning Algorithms on Relational Data with Datalog","authors":"Nantia Makrynioti, N. Vasiloglou, E. Pasalic, V. Vassalos","doi":"10.1145/3209889.3209893","DOIUrl":"https://doi.org/10.1145/3209889.3209893","url":null,"abstract":"The standard process of data science tasks is to prepare features inside a database, export them as a denormalized data frame and then apply machine learning algorithms. This process is not optimal for two reasons. First, it requires denormalization of the database that can convert a small data problem into a big data problem. The second shortcoming is that it assumes that the machine learning algorithm is disentangled from the relational model of the problem. That seems to be a serious limitation since the relational model contains very valuable domain expertise. In this paper we explore the use of convex optimization and specifically linear programming, for modelling machine learning algorithms on relational data in an integrated way with data processing operators. We are using SolverBlox, a framework that accepts as an input Datalog code and feeds it into a linear programming solver. We demonstrate the expression of common machine learning algorithms and present use case scenarios where combining data processing with modelling of optimization problems inside a database offers significant advantages.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73254717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
End-to-End Machine Learning with Apache AsterixDB 端到端机器学习与Apache AsterixDB
Wail Y. Alkowaileet, Sattam Alsubaiee, M. Carey, Chen Li, H. Ramampiaro, Phanwadee Sinthong, Xikui Wang
{"title":"End-to-End Machine Learning with Apache AsterixDB","authors":"Wail Y. Alkowaileet, Sattam Alsubaiee, M. Carey, Chen Li, H. Ramampiaro, Phanwadee Sinthong, Xikui Wang","doi":"10.1145/3209889.3209894","DOIUrl":"https://doi.org/10.1145/3209889.3209894","url":null,"abstract":"Recent developments in machine learning and data science provide a foundation for extracting underlying information from Big Data. Unfortunately, current platforms and tools often require data scientists to glue together and maintain custom-built platforms consisting of multiple Big Data component technologies. In this paper, we explain how Apache AsterixDB, an open source Big Data Management System, can help to reduce the burden involved in using machine learning algorithms in Big Data analytics. In particular, we describe how AsterixDB's built-in support for user-defined functions (UDFs), the availability of UDFs in data ingestion pipelines and queries, and the provision of machine learning platform and notebook inter-operation capabilities can together enable data analysts to more easily create and manage end-to-end analytical dataflows.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89630659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning 第二届端到端机器学习数据管理研讨会论文集
{"title":"Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning","authors":"","doi":"10.1145/3209889","DOIUrl":"https://doi.org/10.1145/3209889","url":null,"abstract":"","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87187269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles 头像:异构用户配置文件的大规模实体分辨率
Janani Balaji, Chris Min, F. Javed, Yun Zhu
{"title":"Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles","authors":"Janani Balaji, Chris Min, F. Javed, Yun Zhu","doi":"10.1145/3209889.3209892","DOIUrl":"https://doi.org/10.1145/3209889.3209892","url":null,"abstract":"Entity Resolution (ER), also known as record linkage or de-duplication, has been a long-standing problem in the data management space. Though an ER system follows an established pipeline involving the Blocking -> Matching -> Clustering components, the Matching forms the core element of an ER system. At CareerBuilder, we perform de-duplication of massive datasets of people profiles collected from disparate sources with varying informational content. In this paper, we discuss the challenges of de-duplicating inherently heterogeneous data and illustrate the end-to-end process of building a functional and scalable machine learning-based matching platform. We also provide an incremental framework to enable differential ER assimilation for continuous de-duplication workflows.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84109657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards Interactive Curation & Automatic Tuning of ML Pipelines 面向ML管道的交互式管理和自动调优
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, E. Upfal, R. Zeleznik, Emanuel Zgraggen
{"title":"Towards Interactive Curation & Automatic Tuning of ML Pipelines","authors":"Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, E. Upfal, R. Zeleznik, Emanuel Zgraggen","doi":"10.1145/3209889.3209891","DOIUrl":"https://doi.org/10.1145/3209889.3209891","url":null,"abstract":"Democratizing Data Science requires a fundamental rethinking of the way data analytics and model discovery is done. Available tools for analyzing massive data sets and curating machine learning models are limited in a number of fundamental ways. First, existing tools require well-trained data scientists to select the appropriate techniques to build models and to evaluate their outcomes. Second, existing tools require heavy data preparation steps and are often too slow to give interactive feedback to domain experts in the model building process, severely limiting the possible interactions. Third, current tools do not provide adequate analysis of statistical risk factors in the model development. In this work, we present the first iteration of QuIC-M (pronounced quick-m), an interactive human-in-the-loop data exploration and model building suite. The goal is to enable domain experts to build the machine learning pipelines an order of magnitude faster than machine learning experts while having model qualities comparable to expert solutions.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74850220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue 异构数据库的高效学习:采样和约束
Jose Picado, Arash Termehchy, Sudhanshu Pathak
{"title":"Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue","authors":"Jose Picado, Arash Termehchy, Sudhanshu Pathak","doi":"10.1145/3209889.3209899","DOIUrl":"https://doi.org/10.1145/3209889.3209899","url":null,"abstract":"Given a relational database and training examples for a target relation, relational learning algorithms learn a definition for the target relation in terms of the existing relations in the database. We propose a relational learning system called CastorX, which learns efficiently across multiple heterogeneous databases. The user specifies connections and relationships between different databases using a set of declarative constraints called matching dependencies (MDs). Each MD connects tuples across multiple databases that are related and can meaningfully join but the values of their join attributes may not be equal due to the different representations of these values in different databases. CastorX leverages these constraints during learning to find the information relevant to the training data and target definition across multiple databases. Since each tuple in a database may be connected to too many tuples in other databases according to an MD, the learning process will become very slow. Hence, CastorX uses sampling techniques to learn efficiently and output accurate definitions.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85995415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Snorkel MeTaL: Weak Supervision for Multi-Task Learning. Snorkel MeTaL:多任务学习的弱监督。
Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, Christopher Ré
{"title":"Snorkel MeTaL: Weak Supervision for Multi-Task Learning.","authors":"Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, Christopher Ré","doi":"10.1145/3209889.3209898","DOIUrl":"10.1145/3209889.3209898","url":null,"abstract":"<p><p>Many real-world machine learning problems are challenging to tackle for two reasons: (i) they involve multiple sub-tasks at different levels of granularity; and (ii) they require large volumes of labeled training data. We propose Snorkel MeTaL, an end-to-end system for multi-task learning that leverages <i>weak</i> supervision provided at <i>multiple levels of granularity</i> by domain expert users. In MeTaL, a user specifies a problem consisting of multiple, hierarchically-related sub-tasks-for example, classifying a document at multiple levels of granularity-and then provides <i>labeling functions</i> for each sub-task as weak supervision. MeTaL learns a re-weighted model of these labeling functions, and uses the combined signal to train a hierarchical multi-task network which is automatically compiled from the structure of the sub-tasks. Using MeTaL on a radiology report triage task and a fine-grained news classification task, we achieve average gains of 11.2 accuracy points over a baseline supervised approach and 9.5 accuracy points over the predictions of the user-provided labeling functions.</p>","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2018 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436830/pdf/nihms-993812.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37107048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Utility of Developer Exhaust. 探索开发人员废气的效用。
Jian Zhang, Max Lam, Stephanie Wang, Paroma Varma, Luigi Nardi, Kunle Olukotun, Christopher Ré
{"title":"Exploring the Utility of Developer Exhaust.","authors":"Jian Zhang,&nbsp;Max Lam,&nbsp;Stephanie Wang,&nbsp;Paroma Varma,&nbsp;Luigi Nardi,&nbsp;Kunle Olukotun,&nbsp;Christopher Ré","doi":"10.1145/3209889.3209895","DOIUrl":"https://doi.org/10.1145/3209889.3209895","url":null,"abstract":"<p><p>Using machine learning to analyze data often results in <i>developer exhaust</i> - code, logs, or metadata that do not define the learning algorithm but are byproducts of the data analytics pipeline. We study how the rich information present in developer exhaust can be used to approximately solve otherwise complex tasks. Specifically, we focus on using log data associated with training deep learning models to perform model search by <i>predicting</i> performance metrics for untrained models. Instead of designing a different model for each performance metric, we present two preliminary methods that rely only on information present in logs to predict these characteristics for different architectures. We introduce (i) a nearest neighbor approach with a hand-crafted edit distance metric to compare model architectures and (ii) a more generalizable, end-to-end approach that trains an LSTM using model architectures and associated logs to predict performance metrics of interest. We perform model search optimizing for best validation accuracy, degree of overfitting, and best validation accuracy given a constraint on training time. Our approaches can predict validation accuracy within 1.37% error on average, while the baseline achieves 4.13% by using the performance of a trained model with the closest number of layers. When choosing the best performing model given constraints on training time, our approaches select the top-3 models that overlap with the true top- 3 models 82% of the time, while the baseline only achieves this 54% of the time. Our preliminary experiments hold promise for how developer exhaust can help <i>learn</i> models that can approximate various complex tasks efficiently.</p>","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"2018 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3209889.3209895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37280256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities 加速人在循环机器学习:挑战与机遇
Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya G. Parameswaran
{"title":"Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities","authors":"Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya G. Parameswaran","doi":"10.1145/3209889.3209897","DOIUrl":"https://doi.org/10.1145/3209889.3209897","url":null,"abstract":"Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a \"human-in-the-loop\" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally describe Helix, our preliminary attempt at such a system that has already led to speedups of upto 10x on typical iterative workflows against competing systems.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"160 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90110347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Learning State Representations for Query Optimization with Deep Reinforcement Learning 基于深度强化学习的查询优化学习状态表示
Jennifer Ortiz, M. Balazinska, J. Gehrke, S. Keerthi
{"title":"Learning State Representations for Query Optimization with Deep Reinforcement Learning","authors":"Jennifer Ortiz, M. Balazinska, J. Gehrke, S. Keerthi","doi":"10.1145/3209889.3209890","DOIUrl":"https://doi.org/10.1145/3209889.3209890","url":null,"abstract":"We explore the idea of using deep reinforcement learning for query optimization. The approach is to build queries incrementally by encoding properties of subqueries using a learned representation. In this paper, we focus specifically on the state representation problem and the formation of the state transition function. We show preliminary results and discuss how we can use the state representation to improve query optimization using reinforcement learning.","PeriodicalId":92710,"journal":{"name":"Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)","volume":"16 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91472090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 140
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信