{"title":"Discovering Data Source Stability Patterns in Biomedical Repositories Based on Simplicial Projections from Probability Distribution Distances","authors":"Pablo Ferri Borreda, C. Sáez, J. M. García-Gómez","doi":"10.1109/CBMS.2017.153","DOIUrl":null,"url":null,"abstract":"The degree of homogeneity of statistical distributions among data sources is a critical issue when reusing data of Integrated Data Repositories (IDR). Evaluating this data source stability is of utmost importance in order to ensure a confident data reuse. This work tackles the task of discovering and classifying patterns among the statistical distributions of multiple sources in IDRs, by means of a novel approach based on simplicial projections from probability distribution distances, combined with Density-based spatial clustering of applications with noise (DBSCAN). The results on the evaluated 20 public repositories support the existence of four main data source stability patterns in biomedical repositories: the global stability pattern (GSP), the local stability pattern (LSP), the sparse stability pattern (SSP) and the instability pattern (IP).","PeriodicalId":141105,"journal":{"name":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS.2017.153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The degree of homogeneity of statistical distributions among data sources is a critical issue when reusing data of Integrated Data Repositories (IDR). Evaluating this data source stability is of utmost importance in order to ensure a confident data reuse. This work tackles the task of discovering and classifying patterns among the statistical distributions of multiple sources in IDRs, by means of a novel approach based on simplicial projections from probability distribution distances, combined with Density-based spatial clustering of applications with noise (DBSCAN). The results on the evaluated 20 public repositories support the existence of four main data source stability patterns in biomedical repositories: the global stability pattern (GSP), the local stability pattern (LSP), the sparse stability pattern (SSP) and the instability pattern (IP).