主题演讲3:生物医学数据分类的随机森林

Conference proceedings. IEEE International Conference on Signal and Image Processing Applications Pub Date : 2017-09-01 DOI:10.1109/ICSIPA.2017.8120567

L. Heutte

{"title":"主题演讲3:生物医学数据分类的随机森林","authors":"L. Heutte","doi":"10.1109/ICSIPA.2017.8120567","DOIUrl":null,"url":null,"abstract":"Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.","PeriodicalId":92495,"journal":{"name":"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications","volume":"41 1","pages":"ix"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Keynote 3: Random forests for biomedical data classification\",\"authors\":\"L. Heutte\",\"doi\":\"10.1109/ICSIPA.2017.8120567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.\",\"PeriodicalId\":92495,\"journal\":{\"name\":\"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications\",\"volume\":\"41 1\",\"pages\":\"ix\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSIPA.2017.8120567\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSIPA.2017.8120567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

学习鲁棒机器模型仍然是生物医学数据分类的一个具有挑战性的问题。为了处理高维数、低样本量、类别不平衡等问题，随机森林(Random Forests, RF)在该领域得到了广泛的应用。RF包括构建分类器集成，随机化以产生基于树的分类器的多样化池。自2001年Leo Breiman引入以来，RF在理论和实验上都得到了广泛的研究，并显示出与最先进的分类器竞争的性能。然而，只有少数研究解决了超参数的选择及其对射频性能的影响所带来的问题。本次演讲将首先讨论我们通过超参数来更好地理解和解释RF的性能的尝试，这些超参数导致我们提出了RF的不同变体，即森林rk和动态随机森林，对参数化的选择不那么敏感，有时对泛化性能至关重要。在第二部分中，我将说明射频在两种医学应用中的使用:肺内窥镜图像的分类和放射组学的癌症分期/患者预测，这是一个越来越受到关注的领域。在处理医疗数据时，可能会出现只有一类数据(例如健康患者)可用于训练的情况。这是典型的肺内窥镜图像的情况，我们提出了一种原始的方法来处理医学图像分类中的异常值，即一类随机森林，它已被证明对我们的问题是有效的，并与其他最先进的一类分类器竞争。射频的第二个应用是Radiomics，这是一个新的(2012年)概念，指的是从多模态医学图像和临床数据、基因或蛋白质数据等信息中提取大量定量肿瘤特征进行分析，以预测患者的进化和/或生存率。在这种情况下，数据是高度多维且异构的。作为正在进行的工作的一部分，我们提出了一种基于差异的随机森林多视图学习模型，其中每个数据视图(或特征组)被单独处理，以便每个视图中的数据维度更小。通过将不同的视图组合在一起，我们可以利用视图之间的异质性，同时避免使用传统的特征选择方法来降低数据的高维数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Keynote 3: Random forests for biomedical data classification

Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Conference proceedings. IEEE International Conference on Signal and Image Processing Applications

自引率

0.00%

发文量