Usage of Topic Modeling Method for High Dimensional Gene Expression Data Analysis

S. Senadheera, A. Weerasinghe
{"title":"Usage of Topic Modeling Method for High Dimensional Gene Expression Data Analysis","authors":"S. Senadheera, A. Weerasinghe","doi":"10.1109/ICITR54349.2021.9657380","DOIUrl":null,"url":null,"abstract":"Gene expression data analysis is a major area in biological system interpretation. Since, gene expression data have large numbers of variables, high dimensional clustering methods are required for analysis. The objectives of this study were to understand the effectiveness of different clustering methods in gene expression data analysis based on biological relatedness and study of the advantages and disadvantages of different clustering strategies in gene expression analysis. The data was obtained from the GSE19830 dataset and the brain tumor data (TCGA project). To test the hard clustering, hierarchical clustering and fuzzy clustering, the K-means algorithm, HClust and topic modeling were used respectively. Prior knowledge about the dataset was required to define the number of clusters (K). Initially, the GSE19830 (Brain, Lung, Liver tissue mixture) dataset was used for developing the clusters. All models clustered the observations similar to the physical tags in the dataset. Secondly, Clustering methods were developed with the brain tumor dataset consisting of 202 samples (four specified physically categorized tumors). According to hierarchical clustering and topic modeling, when analyzing similar tissues, gene expression tumor subtypes (clusters) were not aligned with physical categorization. Finally, 81 cancer genes were filtered and generated a topic model. In order to understand the biological relevance of the final model, Reactome and PCViz tools were used. Reactome results supported topics developed from topic modeling. According to the results, in high dimensional data analysis, topic modeling was found to be a promising approach for gene expression based clustering while K-means was found to be inappropriate for gene clustering.","PeriodicalId":188174,"journal":{"name":"2021 6th International Conference on Information Technology Research (ICITR)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 6th International Conference on Information Technology Research (ICITR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITR54349.2021.9657380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Gene expression data analysis is a major area in biological system interpretation. Since, gene expression data have large numbers of variables, high dimensional clustering methods are required for analysis. The objectives of this study were to understand the effectiveness of different clustering methods in gene expression data analysis based on biological relatedness and study of the advantages and disadvantages of different clustering strategies in gene expression analysis. The data was obtained from the GSE19830 dataset and the brain tumor data (TCGA project). To test the hard clustering, hierarchical clustering and fuzzy clustering, the K-means algorithm, HClust and topic modeling were used respectively. Prior knowledge about the dataset was required to define the number of clusters (K). Initially, the GSE19830 (Brain, Lung, Liver tissue mixture) dataset was used for developing the clusters. All models clustered the observations similar to the physical tags in the dataset. Secondly, Clustering methods were developed with the brain tumor dataset consisting of 202 samples (four specified physically categorized tumors). According to hierarchical clustering and topic modeling, when analyzing similar tissues, gene expression tumor subtypes (clusters) were not aligned with physical categorization. Finally, 81 cancer genes were filtered and generated a topic model. In order to understand the biological relevance of the final model, Reactome and PCViz tools were used. Reactome results supported topics developed from topic modeling. According to the results, in high dimensional data analysis, topic modeling was found to be a promising approach for gene expression based clustering while K-means was found to be inappropriate for gene clustering.
主题建模方法在高维基因表达数据分析中的应用
基因表达数据分析是生物系统解释的一个重要领域。由于基因表达数据包含大量变量,因此需要采用高维聚类方法进行分析。本研究的目的是了解不同聚类方法在基于生物相关性的基因表达数据分析中的有效性,并研究不同聚类策略在基因表达分析中的优缺点。数据来源于GSE19830数据集和脑肿瘤数据(TCGA项目)。为了测试硬聚类、层次聚类和模糊聚类,分别使用K-means算法、HClust和主题建模。需要关于数据集的先验知识来定义聚类的数量(K)。最初,使用GSE19830(脑,肺,肝组织混合物)数据集来开发聚类。所有模型都将与数据集中的物理标签相似的观察结果聚类。其次,利用包含202个样本(4个特定物理分类肿瘤)的脑肿瘤数据集开发聚类方法。根据分层聚类和主题建模,在分析相似组织时,基因表达肿瘤亚型(簇)与物理分类不一致。最后,对81个癌基因进行筛选,生成主题模型。为了了解最终模型的生物学相关性,我们使用了Reactome和PCViz工具。Reactome结果支持从主题建模开发的主题。结果表明,在高维数据分析中,主题建模是一种很有前途的基因表达聚类方法,而K-means则不适合基因聚类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信