利用基因表达数据进行分类的潜在狄利克雷分配

2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE) Pub Date : 2017-10-01 DOI:10.1109/BIBE.2017.00-81

H. Yalamanchili, S. Kho, M. Raymer

{"title":"利用基因表达数据进行分类的潜在狄利克雷分配","authors":"H. Yalamanchili, S. Kho, M. Raymer","doi":"10.1109/BIBE.2017.00-81","DOIUrl":null,"url":null,"abstract":"Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.","PeriodicalId":262603,"journal":{"name":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Latent Dirichlet Allocation for Classification using Gene Expression Data\",\"authors\":\"H. Yalamanchili, S. Kho, M. Raymer\",\"doi\":\"10.1109/BIBE.2017.00-81\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.\",\"PeriodicalId\":262603,\"journal\":{\"name\":\"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2017.00-81\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

了解差异基因表达在癌症发展和分子反应中的作用是一个复杂的问题，仍然具有挑战性，部分原因在于涉及的基因、基因产物和代谢物的数量之多。在本文中，我们采用一种无监督主题模型，潜狄利克雷分配(LDA)来探索健康和癌症组织中的基因表达模式。与其他统计和机器学习方法相比，LDA的一个重要优势是它能够以无监督的方式处理大量特征上的稀疏输入。LDA最近被应用于聚类和探索基因组数据，但没有用于分类和预测。在本文中，我们尝试优化协议和参数，以便有效地实现LDA。在这里，来自乳腺癌和健康组织的信使RNA (mRNA)序列数据被用来确定一种有效的方法，用于将LDA应用于癌症和健康组织的分类。我们将我们的研究分为两个阶段:首先，针对LDA优化各种参数，如主题数，箱数和通道数。接下来，我们开发了一种新的基于lda的分类方法，基于共表达模式的相似性对未知样本进行分类。对该方法有效性的评估表明，与其他方法相比，LDA可以达到较高的准确性。总的来说，我们的研究结果表明，LDA是一种很有前途的方法，可以根据癌症研究中的基因表达数据对组织类型进行分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Latent Dirichlet Allocation for Classification using Gene Expression Data

Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)

自引率

0.00%

发文量