{"title":"主题演讲3:生物医学数据分类的随机森林","authors":"L. Heutte","doi":"10.1109/ICSIPA.2017.8120567","DOIUrl":null,"url":null,"abstract":"Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.","PeriodicalId":92495,"journal":{"name":"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications","volume":"41 1","pages":"ix"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Keynote 3: Random forests for biomedical data classification\",\"authors\":\"L. Heutte\",\"doi\":\"10.1109/ICSIPA.2017.8120567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.\",\"PeriodicalId\":92495,\"journal\":{\"name\":\"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications\",\"volume\":\"41 1\",\"pages\":\"ix\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSIPA.2017.8120567\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference proceedings. IEEE International Conference on Signal and Image Processing Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSIPA.2017.8120567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Keynote 3: Random forests for biomedical data classification
Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.