Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval Pub Date : 2020-09-10 DOI:10.1145/3409256.3409836

Trond Linjordet, K. Balog

{"title":"Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs","authors":"Trond Linjordet, K. Balog","doi":"10.1145/3409256.3409836","DOIUrl":null,"url":null,"abstract":"Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here. Importantly, these findings extend beyond the particular flavor of question answering task we studied and raise a series of difficult questions around template-based synthetic data generation that will necessitate additional research.","PeriodicalId":430907,"journal":{"name":"Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3409256.3409836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here. Importantly, these findings extend beyond the particular flavor of question answering task we studied and raise a series of difficult questions around template-based synthetic data generation that will necessitate additional research.

查看原文本刊更多论文

基于知识图的问题回答的合成训练数据生成

合成数据生成对于训练和评估基于知识图的问答神经模型非常重要。数据的质量以及将数据集划分为训练、验证和测试部分会影响在这些数据上训练的模型的性能。如果合成数据生成依赖于模板(这是该任务的主要方法)，那么如果分区执行得不健康，则可能会通过跨数据分割的共享模板基础泄露信息。本文研究了跨数据分割的这种信息泄漏的程度，以及在泄漏得到控制的情况下训练模型泛化到测试数据的能力。我们发现信息泄漏确实存在，并且影响了性能。同时，经过训练的模型可以泛化到这里给出的经过消毒的分区下的测试数据。重要的是，这些发现超出了我们研究的问答任务的特定风格，并围绕基于模板的合成数据生成提出了一系列难题，这将需要进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

自引率

0.00%

发文量