Latent Dirichlet Allocation for Classification using Gene Expression Data

H. Yalamanchili, S. Kho, M. Raymer
{"title":"Latent Dirichlet Allocation for Classification using Gene Expression Data","authors":"H. Yalamanchili, S. Kho, M. Raymer","doi":"10.1109/BIBE.2017.00-81","DOIUrl":null,"url":null,"abstract":"Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.","PeriodicalId":262603,"journal":{"name":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.
利用基因表达数据进行分类的潜在狄利克雷分配
了解差异基因表达在癌症发展和分子反应中的作用是一个复杂的问题,仍然具有挑战性,部分原因在于涉及的基因、基因产物和代谢物的数量之多。在本文中,我们采用一种无监督主题模型,潜狄利克雷分配(LDA)来探索健康和癌症组织中的基因表达模式。与其他统计和机器学习方法相比,LDA的一个重要优势是它能够以无监督的方式处理大量特征上的稀疏输入。LDA最近被应用于聚类和探索基因组数据,但没有用于分类和预测。在本文中,我们尝试优化协议和参数,以便有效地实现LDA。在这里,来自乳腺癌和健康组织的信使RNA (mRNA)序列数据被用来确定一种有效的方法,用于将LDA应用于癌症和健康组织的分类。我们将我们的研究分为两个阶段:首先,针对LDA优化各种参数,如主题数,箱数和通道数。接下来,我们开发了一种新的基于lda的分类方法,基于共表达模式的相似性对未知样本进行分类。对该方法有效性的评估表明,与其他方法相比,LDA可以达到较高的准确性。总的来说,我们的研究结果表明,LDA是一种很有前途的方法,可以根据癌症研究中的基因表达数据对组织类型进行分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信