Employing Machine Learning Techniques for Data Enrichment: Increasing the Number of Samples for Effective Gene Expression Data Analysis

2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) Pub Date : 2011-11-12 DOI:10.1109/BIBM.2011.105

U. Erdogdu, Mehmet Tan, R. Alhajj, Faruk Polat, D. Demetrick, J. Rokne

{"title":"Employing Machine Learning Techniques for Data Enrichment: Increasing the Number of Samples for Effective Gene Expression Data Analysis","authors":"U. Erdogdu, Mehmet Tan, R. Alhajj, Faruk Polat, D. Demetrick, J. Rokne","doi":"10.1109/BIBM.2011.105","DOIUrl":null,"url":null,"abstract":"For certain domains, e.g. bioinformatics, producing more real samples is costly, error prone and time consuming. Therefore, there is a need for an intelligent automated process capable of substituting the real samples by artificial samples that carry the same characteristics as the real samples and hence could be used for running comprehensive testing of new methodologies. Motivated by this need, we describe a novel approach that integrates Probabilistic Boolean Network and genetic algorithm based techniques into a framework that uses some existing real samples as input and successfully produces new samples as output. The new samples will inspire the characteristics of the existing samples without duplicating them. This leads to diversity in the samples and hence a more rich set of samples to be used in testing. The developed framework incorporates two models (perspectives) for sample generation. We illustrate its applicability for producing new gene expression data samples, a high demanding area that has not received attention. The two perspectives employed in the process are based on models that are not closely related, the independence eliminates the bias of having the produced approach covering only certain characteristics of the domain and leading to samples skewed towards one direction. The produced results are very promising in showing the effectiveness, usefulness and applicability of the proposed multi-model framework.","PeriodicalId":6345,"journal":{"name":"2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)","volume":"9 1","pages":"238-242"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2011.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

For certain domains, e.g. bioinformatics, producing more real samples is costly, error prone and time consuming. Therefore, there is a need for an intelligent automated process capable of substituting the real samples by artificial samples that carry the same characteristics as the real samples and hence could be used for running comprehensive testing of new methodologies. Motivated by this need, we describe a novel approach that integrates Probabilistic Boolean Network and genetic algorithm based techniques into a framework that uses some existing real samples as input and successfully produces new samples as output. The new samples will inspire the characteristics of the existing samples without duplicating them. This leads to diversity in the samples and hence a more rich set of samples to be used in testing. The developed framework incorporates two models (perspectives) for sample generation. We illustrate its applicability for producing new gene expression data samples, a high demanding area that has not received attention. The two perspectives employed in the process are based on models that are not closely related, the independence eliminates the bias of having the produced approach covering only certain characteristics of the domain and leading to samples skewed towards one direction. The produced results are very promising in showing the effectiveness, usefulness and applicability of the proposed multi-model framework.

查看原文本刊更多论文

利用机器学习技术进行数据丰富:增加有效基因表达数据分析的样本数量

对于某些领域，例如生物信息学，生产更多的真实样本是昂贵的，容易出错和耗时的。因此，需要一种智能的自动化过程，能够用与真实样本具有相同特征的人工样本代替真实样本，从而可用于运行新方法的综合测试。在这种需求的激励下，我们描述了一种新的方法，该方法将基于概率布尔网络和遗传算法的技术集成到一个框架中，该框架使用一些现有的真实样本作为输入，并成功地产生新的样本作为输出。新样本将激发现有样本的特征，而不会复制它们。这导致了样本的多样性，因此在测试中使用了更丰富的样本集。开发的框架结合了两个模型(透视图)来生成样本。我们说明了它的适用性，以产生新的基因表达数据样本，一个高要求的领域，尚未得到重视。该过程中采用的两个视角是基于不密切相关的模型，独立性消除了产生的方法仅覆盖领域的某些特征并导致样本向一个方向倾斜的偏见。所得结果显示了所提出的多模型框架的有效性、实用性和适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)

自引率

0.00%

发文量