异构提取的合成与机器学习

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation Pub Date : 2019-06-08 DOI:10.1145/3314221.3322485

Arun Shankar Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, S. Rajamani

{"title":"异构提取的合成与机器学习","authors":"Arun Shankar Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, S. Rajamani","doi":"10.1145/3314221.3322485","DOIUrl":null,"url":null,"abstract":"We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data. We use machine learning models (\"ML models\") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called \"field constraints\") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity. More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds --- handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.","PeriodicalId":441774,"journal":{"name":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","volume":"168 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Synthesis and machine learning for heterogeneous extraction\",\"authors\":\"Arun Shankar Iyer, Manohar Jonnalagedda, Suresh Parthasarathy, Arjun Radhakrishna, S. Rajamani\",\"doi\":\"10.1145/3314221.3322485\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data. We use machine learning models (\\\"ML models\\\") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called \\\"field constraints\\\") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity. More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds --- handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.\",\"PeriodicalId\":441774,\"journal\":{\"name\":\"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation\",\"volume\":\"168 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3314221.3322485\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3314221.3322485","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

我们提出了一种结合程序合成和机器学习社区的技术来从异构数据中提取结构化信息的方法。在许多情况下都会出现这样的问题，例如从网页、机器生成的电子邮件或从多个来源获得的数据中提取属性。我们的目标是从这些数据中提取一组结构化属性。我们使用机器学习模型(“ML模型”)，例如条件随机场来获得潜在属性值的初始标记。然而，这样的模型通常是不可解释的，并且这种模型产生的噪声很难管理或调试。我们使用由这种ML模型产生的(噪声)标签作为程序合成的输入，并生成覆盖输入空间的可解释程序。我们还使用类型规范(称为“字段约束”)来证明提取值的格式良好性。使用合成程序和字段约束，我们以提高标签置信度的方式重新训练ML模型。然后，我们使用这些改进的标签来重新合成一组更好的程序。我们迭代了重新合成程序和重新训练ML模型的过程，发现这样的迭代过程提高了提取过程的质量。这种被称为HDEF的迭代方法是新颖的，它不仅将ML模型与程序合成结合在一起，而且还使程序合成适应处理噪声和异构性的方式。更广泛地说，我们的方法指出了机器学习和编程语言技术可以结合起来的方法，以获得两个世界的最佳效果——处理噪声，使用ML将信号从一个上下文中传输到另一个上下文中，使用PL生成可解释的程序，并最大限度地减少用户干预。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Synthesis and machine learning for heterogeneous extraction

We present a way to combine techniques from the program synthesis and machine learning communities to extract structured information from heterogeneous data. Such problems arise in several situations such as extracting attributes from web pages, machine-generated emails, or from data obtained from multiple sources. Our goal is to extract a set of structured attributes from such data. We use machine learning models ("ML models") such as conditional random fields to get an initial labeling of potential attribute values. However, such models are typically not interpretable, and the noise produced by such models is hard to manage or debug. We use (noisy) labels produced by such ML models as inputs to program synthesis, and generate interpretable programs that cover the input space. We also employ type specifications (called "field constraints") to certify well-formedness of extracted values. Using synthesized programs and field constraints, we re-train the ML models with improved confidence on the labels. We then use these improved labels to re-synthesize a better set of programs. We iterate the process of re-synthesizing the programs and re-training the ML models, and find that such an iterative process improves the quality of the extraction process. This iterative approach, called HDEF, is novel, not only the in way it combines the ML models with program synthesis, but also in the way it adapts program synthesis to deal with noise and heterogeneity. More broadly, our approach points to ways by which machine learning and programming language techniques can be combined to get the best of both worlds --- handling noise, transferring signals from one context to another using ML, producing interpretable programs using PL, and minimizing user intervention.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

自引率

0.00%

发文量