G. Guernec, Valérie Garès, J. Omer, Philippe Saint-Pierre, N. Savy
{"title":"OTrecod: An R Package for Data Fusion using Optimal Transportation Theory","authors":"G. Guernec, Valérie Garès, J. Omer, Philippe Saint-Pierre, N. Savy","doi":"10.32614/rj-2023-006","DOIUrl":null,"url":null,"abstract":"The advances of information technologies often confront users with a large amount of data which is essential to integrate easily. In this context, creating a single database from multiple separate data sources can appear as an attractive but complex issue when same information of interest is stored in at least two distinct encodings. In this situation, merging the data sources consists in finding a common recoding scale to fill the incomplete information in a synthetic database. The OTrecod package provides R-users two functions dedicated to solve this recoding problem using optimal transportation theory. Specific arguments of these functions enrich the algorithms by relaxing distributional constraints or adding a regularization term to make the data fusion more flexible. The OTrecod package also provides a set of support functions dedicated to the harmonization of separate data sources, the handling of incomplete information and the selection of matching variables. This paper gives all the keys to quickly understand and master the original algorithms implemented in the OTrecod package, assisting step by step the user in its data fusion project.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"43 1","pages":"195-222"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2023-006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The advances of information technologies often confront users with a large amount of data which is essential to integrate easily. In this context, creating a single database from multiple separate data sources can appear as an attractive but complex issue when same information of interest is stored in at least two distinct encodings. In this situation, merging the data sources consists in finding a common recoding scale to fill the incomplete information in a synthetic database. The OTrecod package provides R-users two functions dedicated to solve this recoding problem using optimal transportation theory. Specific arguments of these functions enrich the algorithms by relaxing distributional constraints or adding a regularization term to make the data fusion more flexible. The OTrecod package also provides a set of support functions dedicated to the harmonization of separate data sources, the handling of incomplete information and the selection of matching variables. This paper gives all the keys to quickly understand and master the original algorithms implemented in the OTrecod package, assisting step by step the user in its data fusion project.