{"title":"Machine Learning for Data Management: Problems and Solutions","authors":"Pedro M. Domingos","doi":"10.1145/3183713.3199515","DOIUrl":null,"url":null,"abstract":"Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"52 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3199515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.