Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.

Proceedings. International Conference on Data Engineering Pub Date : 2017-04-01 Epub Date: 2017-05-18 DOI:10.1109/ICDE.2017.223

Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos

{"title":"Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis.","authors":"Vineet K Raghu, Xiaoyu Ge, Panos K Chrysanthis, Panayiotis V Benos","doi":"10.1109/ICDE.2017.223","DOIUrl":null,"url":null,"abstract":"<p><p>The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.</p>","PeriodicalId":74570,"journal":{"name":"Proceedings. International Conference on Data Engineering","volume":"2017 ","pages":"1525-1532"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5799807/pdf/nihms937517.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2017.223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/5/18 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.

Abstract Image

查看原文本刊更多论文

基因表达数据分析中整合理论和数据驱动的特征选择。

高维生物数据的指数级增长导致对知识生产自动化方法的需求迅速增加。现有方法依赖于两种一般方法来应对这一挑战:1)理论驱动方法，利用先前积累的知识;2)数据驱动方法，仅利用数据推断科学知识。这两种方法都存在对过去/现在知识的偏见，因为它们无法将所有可用的当前知识纳入新发现中。在本文中，我们展示了一种集成方法如何有效地解决大生物数据的高维问题，这是纯数据驱动分析方法的一个主要问题。我们在一种新的两步分析工作流程中实现了我们的方法，该工作流程将新的特征选择范式作为处理高通量基因表达数据分析的第一步，并利用图形因果建模作为处理因果关系自动提取的第二步。我们在来自癌症基因组图谱(TCGA)的真实临床数据集上的研究结果表明，我们的方法能够智能地选择基因以学习有效的因果网络。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. International Conference on Data Engineering

CiteScore

6.10

自引率

0.00%

发文量