Composite large margin classifiers with latent subclasses for heterogeneous biomedical data.

IF 3.6 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining Pub Date : 2016-04-01 Epub Date: 2016-01-08 DOI:10.1002/sam.11300

Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok

{"title":"Composite large margin classifiers with latent subclasses for heterogeneous biomedical data.","authors":"Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok","doi":"10.1002/sam.11300","DOIUrl":null,"url":null,"abstract":"<p><p>High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 2","pages":"75-88"},"PeriodicalIF":3.6000,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4912001/pdf/nihms737408.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/sam.11300","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/1/8 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

Abstract Image

查看原文本刊更多论文

针对异构生物医学数据的具有潜在子类的复合大余量分类器。

高维分类问题普遍存在于各种现代科学应用中。尽管有大量的候选分类技术可供使用，但实践者往往面临在线性分类器和一般非线性分类器之间做出选择的难题。具体来说，简单的线性分类器具有良好的可解释性，但在处理结构复杂的数据时可能会受到限制。相比之下，一般非线性分类器更加灵活，但可能会失去可解释性，而且有更高的过拟合倾向。在本文中，我们考虑了在感兴趣的类别中存在潜在子群的数据。我们提出了一种新方法，即复合大边际分类器（CLM），以解决潜在子类的分类问题。CLM 的目标是同时找到三个线性函数：一个线性函数将数据分成两部分，每一部分由不同的线性分类器进行分类。我们的方法具有与一般非线性分类器相当的预测精度，并且保持了传统线性分类器的可解释性。我们通过蒙特卡洛实验与现有的几种线性和非线性分类器进行比较，证明了 CLM 的性能具有竞争力。使用 CLM 分析阿尔茨海默病分类问题不仅能降低区分病例和对照组的分类误差，还能识别对照组中将来更有可能患病的子类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.