Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning最新文献

Using Word Embedding to Enable Semantic Queries in Relational Databases 在关系数据库中使用词嵌入实现语义查询

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076251

R. Bordawekar, O. Shmueli

{"title":"Using Word Embedding to Enable Semantic Queries in Relational Databases","authors":"R. Bordawekar, O. Shmueli","doi":"10.1145/3076246.3076251","DOIUrl":"https://doi.org/10.1145/3076246.3076251","url":null,"abstract":"We investigate opportunities for exploiting Artificial Intelligence (AI) techniques for enhancing capabilities of relational databases. In particular, we explore applications of Natural Language Processing (NLP) techniques to endow relational databases with capabilities that were very hard to realize in practice. We apply an unsupervised neural-network based NLP idea, Distributed Representation via Word Embedding, to extract latent information from a relational table. The word embedding model is based on meaningful textual view of a relational database and captures inter-/intra-attribute relationships between database tokens. For each database token, the model includes a vector that encodes these contextual semantic relationships. These vectors enable processing a new class of SQL-based business intelligence queries called cognitive intelligence (CI) queries that use the generated vectors to analyze contextual semantic relationships between database tokens. The cognitive capabilities enable complex queries such as semantic matching, reasoning queries such as analogies, predictive queries using entities not present in a database, and using knowledge from external sources.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126520848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

EMT: End To End Model Training for MSR Machine Translation EMT: MSR机器翻译的端到端模型培训

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076247

Vishal Chowdhary, Scott Greenwood

{"title":"EMT: End To End Model Training for MSR Machine Translation","authors":"Vishal Chowdhary, Scott Greenwood","doi":"10.1145/3076246.3076247","DOIUrl":"https://doi.org/10.1145/3076246.3076247","url":null,"abstract":"Machine translation, at its core, is a Machine Learning (ML) problem that involves learning language translation by looking at large amounts of parallel data i.e. translations of the same dataset in two or more languages. If we have parallel data between languages L1 and L2, we can build translation systems between these two languages. When training a complete system, we train several different models, each containing a different type of information about either one of the languages or the relationship between the two. We end up training thousands of models to support hundreds of languages. In this article, we explain our end to end architecture of automatically training and deploying models at scale. The goal of this project is to create a fully automated system responsible for gathering new data, training systems, and shipping them to production with little or no guidance from an administrator. By using the ever changing and always expanding contents of the web, we have a system that can quietly improve our existing systems over time. In this article, we detail the architecture and talk about the various problems and the solutions we arrived upon. Finally, we demonstrate experiments and data showing the impact of our work. Specifically, this system has enabled us to ship much more frequently and eliminate human errors which happen when running repetitive tasks. The principles of this pipeline can be applied to any ML training and deployment system.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126191113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Versioning for End-to-End Machine Learning Pipelines 端到端机器学习管道的版本控制

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076248

T. V. D. Weide, D. Papadopoulos, O. Smirnov, Michal Zielinski, T. V. Kasteren

{"title":"Versioning for End-to-End Machine Learning Pipelines","authors":"T. V. D. Weide, D. Papadopoulos, O. Smirnov, Michal Zielinski, T. V. Kasteren","doi":"10.1145/3076246.3076248","DOIUrl":"https://doi.org/10.1145/3076246.3076248","url":null,"abstract":"End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed. In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the amount of redundant computations. Our extensions include partial derivations to simplify navigation and reuse, explicit support for schema changes of pipelines, and a central registry of running pipelines to coordinate upgrading pipelines between teams.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Model-based Pricing: Do Not Pay for More than What You Learn! 基于模型的定价:不要为你学到的东西付费!

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076250

Lingjiao Chen, Paraschos Koutris, Arun Kumar

引用次数: 5

Towards Automatically Setting Language Bias in Relational Learning 关系学习中语言偏差的自动设置

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076249

Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak

引用次数: 2

On Model Discovery For Hosted Data Science Projects 托管数据科学项目的模型发现

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 2017-05-14 DOI: 10.1145/3076246.3076252

Hui Miao, Ang Li, L. Davis, A. Deshpande

{"title":"On Model Discovery For Hosted Data Science Projects","authors":"Hui Miao, Ang Li, L. Davis, A. Deshpande","doi":"10.1145/3076246.3076252","DOIUrl":"https://doi.org/10.1145/3076246.3076252","url":null,"abstract":"Alongside developing systems for scalable machine learning and collaborative data science activities, there is an increasing trend toward publicly shared data science projects, hosted in general or dedicated hosting services, such as GitHub and DataHub. The artifacts of the hosted projects are rich and include not only text files, but also versioned datasets, trained models, project documents, etc. Under the fast pace and expectation of data science activities, model discovery, i.e., finding relevant data science projects to reuse, is an important task in the context of data management for end-to-end machine learning. In this paper, we study the task and present the ongoing work on ModelHub Discovery, a system for finding relevant models in hosted data science projects. Instead of prescribing a structured data model for data science projects, we take an information retrieval approach by decomposing the discovery task into three major steps: project query and matching, model comparison and ranking, and processing and building ensembles with returned models. We describe the motivation and desiderata, propose techniques, and present opportunities and challenges for model discovery for hosted data science projects.","PeriodicalId":118931,"journal":{"name":"Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130860052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning 第一届端到端机器学习数据管理研讨会论文集

Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning Pub Date : 1900-01-01 DOI: 10.1145/3076246

引用次数: 0