Machine Learning for Data Management: Problems and Solutions

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI:10.1145/3183713.3199515

Pedro M. Domingos

{"title":"Machine Learning for Data Management: Problems and Solutions","authors":"Pedro M. Domingos","doi":"10.1145/3183713.3199515","DOIUrl":null,"url":null,"abstract":"Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"52 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3199515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.

查看原文本刊更多论文

数据管理中的机器学习:问题与解决方案

近年来，机器学习取得了长足的进步，其应用正在迅速蔓延。不幸的是，标准的机器学习公式不能很好地与数据管理问题匹配。例如，大多数学习算法假设数据包含在单个表中，并由i.i.d(独立且同分布)样本组成。这将导致临时解决方案的激增、开发缓慢和次优结果。幸运的是，机器学习理论和实践的主体正在开发，免除这些假设，并承诺使机器学习数据管理更容易和更有效[1]。特别是，像马尔可夫逻辑这样的表示，它包括许多类型的深度网络作为特殊情况，允许我们在非深度网络上定义非常丰富的概率分布。，多关系数据[2]。尽管它们具有普遍性，但学习这些模型的参数仍然是一个凸优化问题，允许有效的解决方案。学习结构——在马尔可夫逻辑的情况下，一阶逻辑中的一组公式——是难以处理的，就像在更传统的表示中一样，但可以使用归纳逻辑编程技术有效地完成。推理使用定理证明的概率推广进行，并且在可处理的马尔可夫逻辑中占用线性时间和空间，马尔可夫逻辑是马尔可夫逻辑的面向对象专门化[3]。这些技术为实体解析、模式匹配、本体对齐和信息提取等问题提供了最先进的原则性解决方案。使用可处理的马尔可夫逻辑，我们从Web中提取了一个包含数百万个对象和数十亿个参数的概率知识库，可以使用RDBMS后端在亚秒内精确地查询[3]。有了这些基础，我们预计机器学习在数据管理中的应用将在未来几年继续加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 International Conference on Management of Data

自引率

0.00%

发文量