On the Use of Optimal Transportation Theory to Recode Variables and Application to Database Merging

IF 1.2 4区数学

International Journal of Biostatistics Pub Date : 2019-09-14 DOI:10.1515/ijb-2018-0106

Valérie Garès, C. Dimeglio, G. Guernec, Romain Fantin, B. Lepage, M. Kosorok, N. Savy

{"title":"On the Use of Optimal Transportation Theory to Recode Variables and Application to Database Merging","authors":"Valérie Garès, C. Dimeglio, G. Guernec, Romain Fantin, B. Lepage, M. Kosorok, N. Savy","doi":"10.1515/ijb-2018-0106","DOIUrl":null,"url":null,"abstract":"Abstract Merging databases is a strategy of paramount interest especially in medical research. A common problem in this context comes from a variable which is not coded on the same scale in both databases we aim to merge. This paper considers the problem of finding a relevant way to recode the variable in order to merge these two databases. To address this issue, an algorithm, based on optimal transportation theory, is proposed. Optimal transportation theory gives us an application to map the measure associated with the variable in database A to the measure associated with the same variable in database B. To do so, a cost function has to be introduced and an allocation rule has to be defined. Such a function and such a rule is proposed involving the information contained in the covariates. In this paper, the method is compared to multiple imputation by chained equations and a statistical learning method and has demonstrated a better average accuracy in many situations. Applications on both simulated and real datasets show that the efficiency of the proposed merging algorithm depends on how the covariates are linked with the variable of interest.","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2019-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/ijb-2018-0106","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/ijb-2018-0106","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Abstract Merging databases is a strategy of paramount interest especially in medical research. A common problem in this context comes from a variable which is not coded on the same scale in both databases we aim to merge. This paper considers the problem of finding a relevant way to recode the variable in order to merge these two databases. To address this issue, an algorithm, based on optimal transportation theory, is proposed. Optimal transportation theory gives us an application to map the measure associated with the variable in database A to the measure associated with the same variable in database B. To do so, a cost function has to be introduced and an allocation rule has to be defined. Such a function and such a rule is proposed involving the information contained in the covariates. In this paper, the method is compared to multiple imputation by chained equations and a statistical learning method and has demonstrated a better average accuracy in many situations. Applications on both simulated and real datasets show that the efficiency of the proposed merging algorithm depends on how the covariates are linked with the variable of interest.

查看原文本刊更多论文

最优运输理论在变量重编码中的应用及其在数据库合并中的应用

摘要数据库合并是一种非常重要的策略，特别是在医学研究中。在这种情况下，一个常见的问题来自于一个变量，该变量在我们打算合并的两个数据库中没有以相同的规模编码。本文考虑的问题是找到一种相关的方法来重新编码变量，以便合并这两个数据库。为了解决这一问题，提出了一种基于最优运输理论的算法。最优运输理论为我们提供了一个应用程序，将与数据库A中变量相关的度量映射到与数据库b中相同变量相关的度量。要做到这一点，必须引入成本函数并定义分配规则。提出了一个包含协变量信息的函数和规则。本文将该方法与链式多次插值法和统计学习法进行了比较，结果表明该方法在许多情况下具有更好的平均精度。在模拟和实际数据集上的应用表明，所提出的合并算法的效率取决于协变量与感兴趣变量的联系方式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Biostatistics Mathematics-Statistics and Probability

CiteScore

2.30

自引率

8.30%

发文量

期刊介绍： The International Journal of Biostatistics (IJB) seeks to publish new biostatistical models and methods, new statistical theory, as well as original applications of statistical methods, for important practical problems arising from the biological, medical, public health, and agricultural sciences with an emphasis on semiparametric methods. Given many alternatives to publish exist within biostatistics, IJB offers a place to publish for research in biostatistics focusing on modern methods, often based on machine-learning and other data-adaptive methodologies, as well as providing a unique reading experience that compels the author to be explicit about the statistical inference problem addressed by the paper. IJB is intended that the journal cover the entire range of biostatistics, from theoretical advances to relevant and sensible translations of a practical problem into a statistical framework. Electronic publication also allows for data and software code to be appended, and opens the door for reproducible research allowing readers to easily replicate analyses described in a paper. Both original research and review articles will be warmly received, as will articles applying sound statistical methods to practical problems.