Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning最新文献

筛选
英文 中文
Using Word Embedding to Enable Semantic Queries in Relational Databases 在关系数据库中使用词嵌入实现语义查询
R. Bordawekar, O. Shmueli
{"title":"Using Word Embedding to Enable Semantic Queries in Relational Databases","authors":"R. Bordawekar, O. Shmueli","doi":"10.1145/3076246.3076251","DOIUrl":"https://doi.org/10.1145/3076246.3076251","url":null,"abstract":"We investigate opportunities for exploiting Artificial Intelligence (AI) techniques for enhancing capabilities of relational databases. In particular, we explore applications of Natural Language Processing (NLP) techniques to endow relational databases with capabilities that were very hard to realize in practice. We apply an unsupervised neural-network based NLP idea, Distributed Representation via Word Embedding, to extract latent information from a relational table. The word embedding model is based on meaningful textual view of a relational database and captures inter-/intra-attribute relationships between database tokens. For each database token, the model includes a vector that encodes these contextual semantic relationships. These vectors enable processing a new class of SQL-based business intelligence queries called cognitive intelligence (CI) queries that use the generated vectors to analyze contextual semantic relationships between database tokens. The cognitive capabilities enable complex queries such as semantic matching, reasoning queries such as analogies, predictive queries using entities not present in a database, and using knowledge from external sources.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126520848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
EMT: End To End Model Training for MSR Machine Translation EMT: MSR机器翻译的端到端模型培训
Vishal Chowdhary, Scott Greenwood
{"title":"EMT: End To End Model Training for MSR Machine Translation","authors":"Vishal Chowdhary, Scott Greenwood","doi":"10.1145/3076246.3076247","DOIUrl":"https://doi.org/10.1145/3076246.3076247","url":null,"abstract":"Machine translation, at its core, is a Machine Learning (ML) problem that involves learning language translation by looking at large amounts of parallel data i.e. translations of the same dataset in two or more languages. If we have parallel data between languages L1 and L2, we can build translation systems between these two languages. When training a complete system, we train several different models, each containing a different type of information about either one of the languages or the relationship between the two. We end up training thousands of models to support hundreds of languages. In this article, we explain our end to end architecture of automatically training and deploying models at scale. The goal of this project is to create a fully automated system responsible for gathering new data, training systems, and shipping them to production with little or no guidance from an administrator. By using the ever changing and always expanding contents of the web, we have a system that can quietly improve our existing systems over time. In this article, we detail the architecture and talk about the various problems and the solutions we arrived upon. Finally, we demonstrate experiments and data showing the impact of our work. Specifically, this system has enabled us to ship much more frequently and eliminate human errors which happen when running repetitive tasks. The principles of this pipeline can be applied to any ML training and deployment system.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126191113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Versioning for End-to-End Machine Learning Pipelines 端到端机器学习管道的版本控制
T. V. D. Weide, D. Papadopoulos, O. Smirnov, Michal Zielinski, T. V. Kasteren
{"title":"Versioning for End-to-End Machine Learning Pipelines","authors":"T. V. D. Weide, D. Papadopoulos, O. Smirnov, Michal Zielinski, T. V. Kasteren","doi":"10.1145/3076246.3076248","DOIUrl":"https://doi.org/10.1145/3076246.3076248","url":null,"abstract":"End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed. In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the amount of redundant computations. Our extensions include partial derivations to simplify navigation and reuse, explicit support for schema changes of pipelines, and a central registry of running pipelines to coordinate upgrading pipelines between teams.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Model-based Pricing: Do Not Pay for More than What You Learn! 基于模型的定价:不要为你学到的东西付费!
Lingjiao Chen, Paraschos Koutris, Arun Kumar
{"title":"Model-based Pricing: Do Not Pay for More than What You Learn!","authors":"Lingjiao Chen, Paraschos Koutris, Arun Kumar","doi":"10.1145/3076246.3076250","DOIUrl":"https://doi.org/10.1145/3076246.3076250","url":null,"abstract":"While a lot of work has focused on improving the efficiency, scalability, and usability of machine learning (ML), little work has studied the cost of data acquisition for ML-based analytics. Datasets are already being bought and sold in marketplaces for various tasks, including ML. But current marketplaces force users to buy such data in whole or as fixed subsets without any awareness of the ML tasks they are used for. This leads to sub-optimal choices and missed opportunities for both data sellers and buyers. In this paper, we outline our vision for a formal and practical pricing framework we call model-based pricing that aims to resolve such issues. Our key observation is that ML users typically need only as much data as needed to meet their accuracy goals, which leads to novel trade-offs between price, accuracy, and runtimes. We explain how this raises interesting new research questions at the intersection of data management, ML, and micro-economics.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130076167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards Automatically Setting Language Bias in Relational Learning 关系学习中语言偏差的自动设置
Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak
{"title":"Towards Automatically Setting Language Bias in Relational Learning","authors":"Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak","doi":"10.1145/3076246.3076249","DOIUrl":"https://doi.org/10.1145/3076246.3076249","url":null,"abstract":"Relational databases are valuable resources for learning novel and interesting relations and concepts. Relational learning algorithms learn the definition of new relations in terms of the existing relations in the database. In order to constraint the search through the large space of candidate definitions, users must specify a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. Hence, it normally takes a great deal of time and effort to effectively use these algorithms. We report our on-going work on building AutoMode, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning algorithms.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121609976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On Model Discovery For Hosted Data Science Projects 托管数据科学项目的模型发现
Hui Miao, Ang Li, L. Davis, A. Deshpande
{"title":"On Model Discovery For Hosted Data Science Projects","authors":"Hui Miao, Ang Li, L. Davis, A. Deshpande","doi":"10.1145/3076246.3076252","DOIUrl":"https://doi.org/10.1145/3076246.3076252","url":null,"abstract":"Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub and DataHub. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. Under the fast pace and expectation of data science activities, model discovery, i.e., finding relevant data science projects to reuse, is an important task in the context of data management for end-to-end machine learning. In this paper, we study the task and present the ongoing work on ModelHub Discovery, a system for finding relevant models in hosted data science projects. Instead of prescribing a structured data model for data science projects, we take an information retrieval approach by decomposing the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. We describe the motivation and desiderata, propose techniques, and present opportunities and challenges for model discovery for hosted data science projects.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130860052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning 第一届端到端机器学习数据管理研讨会论文集
{"title":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","authors":"","doi":"10.1145/3076246","DOIUrl":"https://doi.org/10.1145/3076246","url":null,"abstract":"","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129472533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信