Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye
{"title":"基于块缺失数据的多源学习用于阿尔茨海默病预测","authors":"Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye","doi":"10.1145/2487575.2487594","DOIUrl":null,"url":null,"abstract":"With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level\" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":"{\"title\":\"Multi-source learning with block-wise missing data for Alzheimer's disease prediction\",\"authors\":\"Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye\",\"doi\":\"10.1145/2487575.2487594\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level\\\" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.\",\"PeriodicalId\":20472,\"journal\":{\"name\":\"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"75\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2487575.2487594\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-source learning with block-wise missing data for Alzheimer's disease prediction
With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.