Multi-source learning with block-wise missing data for Alzheimer's disease prediction

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI:10.1145/2487575.2487594

Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye

{"title":"Multi-source learning with block-wise missing data for Alzheimer's disease prediction","authors":"Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye","doi":"10.1145/2487575.2487594","DOIUrl":null,"url":null,"abstract":"With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level\" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 75

Abstract

With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.

查看原文本刊更多论文

基于块缺失数据的多源学习用于阿尔茨海默病预测

随着数据收集技术的进步和日益复杂，我们面临着在许多应用程序中从多个异构源收集大量数据的问题。例如，在阿尔茨海默病(AD)的研究中，经常收集和分析不同类型的测量数据，如神经图像、基因/蛋白质表达数据、遗传数据等，以提高预测能力。人们认为，多个数据源的联合学习是有益的，因为不同的数据源可能包含互补的信息，而特征修剪和数据源选择对于从高维数据中学习可解释模型至关重要。通常收集到的数据都有块丢失的条目;例如，未进行MRI扫描的患者在MRI数据块中没有任何信息，使其整体记录不完整。数据挖掘界对将传统的单源完整数据分析技术扩展到多源不完整数据的研究越来越感兴趣。关键的挑战是如何在存在块丢失数据的情况下有效地集成来自多个异构源的信息。本文首先研究了数据完备的情况，提出了一种统一的多源数据“双层次”学习模型。然后，我们将该模型自然扩展到具有不完整数据的更具挑战性的情况。我们的主要贡献有三个方面:(1)提出的模型以统一的形式处理特征级和源级分析，并将几种现有的特征学习方法作为特殊情况;(2)不完全数据模型避免了缺失元素的直接代入，具有较好的性能。此外，它可以很容易地推广到其他具有块丢失数据源的应用程序;(3)针对完全模型和不完全模型分别提出了高效的优化算法。我们对提出的模型在AD诊断中的应用进行了全面的评估。我们提出的模型与现有方法比较有利。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量