基于特征重要性的UMAP可视化聚合物空间解释。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics Pub Date : 2023-08-01 Epub Date: 2023-06-16 DOI:10.1002/minf.202300061

Takuya Ehiro

{"title":"基于特征重要性的UMAP可视化聚合物空间解释。","authors":"Takuya Ehiro","doi":"10.1002/minf.202300061","DOIUrl":null,"url":null,"abstract":"Dimensionality reduction (DR) techniques are used for various purposes such as exploratory data analysis. A commonly employed linear DR technique is principal component analysis (PCA), which is one of the most popular methods for DR. Owing to its linear nature, PCA enables the determination of axes in a low-dimensional space and the calculation of corresponding loading vectors. However, PCA cannot necessarily extract important features of non-linearly distributed data. This study presents a technique aimed at aiding the interpretation of data reduced through non-linear DR methods. In the proposed method, non-linear dimensionally reduced data was clustered via a density-based clustering method. Thereafter, the obtained cluster labels were classified by random forest (RF) classifiers. Further, feature importance (FI) of RF classifiers and Spearman's rank correlation coefficients between predictive probabilities to obtained clusters and original feature values were utilized for characterizing the visualized dimensionally reduced data. The results revealed that the proposed method can provide the interpretable FI-based images of the handwritten digits dataset. Moreover, the proposed method was also applied to the polymer dataset. The study found that incorporating signed FI was advantageous in achieving a meaningful interpretation. Furthermore, Gaussian process regression was utilized to produce intuitive FI-based heatmaps on a 2-dimensional space for greater ease of understanding. Additionally, to enhance the interpretability of the obtained clusters, a feature selection technique called Boruta was applied. The Boruta feature selection method worked effectively to interpret the obtained clusters with limited and commonly important features. Additionally, the study suggested that computing FI solely from substructure-based descriptors could further enhance the interpretability of the results. Finally, the automation of the proposed method was investigated, and through maximizing the target score based on the quality of both the DR and clustering, indicative results were automatically obtained for both the handwritten digits and polymer datasets.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"42 8-9","pages":"e2300061"},"PeriodicalIF":2.8000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature importance-based interpretation of UMAP-visualized polymer space.\",\"authors\":\"Takuya Ehiro\",\"doi\":\"10.1002/minf.202300061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dimensionality reduction (DR) techniques are used for various purposes such as exploratory data analysis. A commonly employed linear DR technique is principal component analysis (PCA), which is one of the most popular methods for DR. Owing to its linear nature, PCA enables the determination of axes in a low-dimensional space and the calculation of corresponding loading vectors. However, PCA cannot necessarily extract important features of non-linearly distributed data. This study presents a technique aimed at aiding the interpretation of data reduced through non-linear DR methods. In the proposed method, non-linear dimensionally reduced data was clustered via a density-based clustering method. Thereafter, the obtained cluster labels were classified by random forest (RF) classifiers. Further, feature importance (FI) of RF classifiers and Spearman's rank correlation coefficients between predictive probabilities to obtained clusters and original feature values were utilized for characterizing the visualized dimensionally reduced data. The results revealed that the proposed method can provide the interpretable FI-based images of the handwritten digits dataset. Moreover, the proposed method was also applied to the polymer dataset. The study found that incorporating signed FI was advantageous in achieving a meaningful interpretation. Furthermore, Gaussian process regression was utilized to produce intuitive FI-based heatmaps on a 2-dimensional space for greater ease of understanding. Additionally, to enhance the interpretability of the obtained clusters, a feature selection technique called Boruta was applied. The Boruta feature selection method worked effectively to interpret the obtained clusters with limited and commonly important features. Additionally, the study suggested that computing FI solely from substructure-based descriptors could further enhance the interpretability of the results. Finally, the automation of the proposed method was investigated, and through maximizing the target score based on the quality of both the DR and clustering, indicative results were automatically obtained for both the handwritten digits and polymer datasets.\",\"PeriodicalId\":18853,\"journal\":{\"name\":\"Molecular Informatics\",\"volume\":\"42 8-9\",\"pages\":\"e2300061\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2023-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/minf.202300061\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/6/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202300061","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/6/16 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

降维（DR）技术用于各种目的，例如探索性数据分析。一种常用的线性DR技术是主成分分析（PCA），它是DR最流行的方法之一。由于其线性性质，PCA能够在低维空间中确定轴并计算相应的载荷矢量。然而，主成分分析不一定能提取非线性分布数据的重要特征。本研究提出了一种旨在帮助解释通过非线性DR方法减少的数据的技术。在所提出的方法中，通过基于密度的聚类方法对非线性降维数据进行聚类。之后，通过随机森林（RF）分类器对所获得的聚类标签进行分类。此外，RF分类器的特征重要性（FI）和所获得聚类的预测概率与原始特征值之间的Spearman秩相关系数被用于表征可视化的降维数据。结果表明，该方法可以提供可解释的手写数字数据集的基于FI的图像。此外，该方法还应用于聚合物数据集。研究发现，结合有符号的FI有利于实现有意义的解释。此外，高斯过程回归用于在二维空间上生成直观的基于FI的热图，以便于理解。此外，为了增强所获得聚类的可解释性，应用了一种名为Boruta的特征选择技术。Boruta特征选择方法有效地解释了所获得的具有有限且通常重要特征的聚类。此外，该研究表明，仅从基于子结构的描述符计算FI可以进一步提高结果的可解释性。最后，研究了所提出方法的自动化，并通过基于DR和聚类的质量最大化目标分数，自动获得手写数字和聚合物数据集的指示结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Feature importance-based interpretation of UMAP-visualized polymer space.

查看原文本刊更多论文

Feature importance-based interpretation of UMAP-visualized polymer space.

Dimensionality reduction (DR) techniques are used for various purposes such as exploratory data analysis. A commonly employed linear DR technique is principal component analysis (PCA), which is one of the most popular methods for DR. Owing to its linear nature, PCA enables the determination of axes in a low-dimensional space and the calculation of corresponding loading vectors. However, PCA cannot necessarily extract important features of non-linearly distributed data. This study presents a technique aimed at aiding the interpretation of data reduced through non-linear DR methods. In the proposed method, non-linear dimensionally reduced data was clustered via a density-based clustering method. Thereafter, the obtained cluster labels were classified by random forest (RF) classifiers. Further, feature importance (FI) of RF classifiers and Spearman's rank correlation coefficients between predictive probabilities to obtained clusters and original feature values were utilized for characterizing the visualized dimensionally reduced data. The results revealed that the proposed method can provide the interpretable FI-based images of the handwritten digits dataset. Moreover, the proposed method was also applied to the polymer dataset. The study found that incorporating signed FI was advantageous in achieving a meaningful interpretation. Furthermore, Gaussian process regression was utilized to produce intuitive FI-based heatmaps on a 2-dimensional space for greater ease of understanding. Additionally, to enhance the interpretability of the obtained clusters, a feature selection technique called Boruta was applied. The Boruta feature selection method worked effectively to interpret the obtained clusters with limited and commonly important features. Additionally, the study suggested that computing FI solely from substructure-based descriptors could further enhance the interpretability of the results. Finally, the automation of the proposed method was investigated, and through maximizing the target score based on the quality of both the DR and clustering, indicative results were automatically obtained for both the handwritten digits and polymer datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.