A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology Pub Date : 2016-10-01 DOI:10.1515/sagmb-2015-0082

Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung

{"title":"A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments","authors":"Jochen Kruppa, F. Kramer, T. Beissbarth, K. Jung","doi":"10.1515/sagmb-2015-0082","DOIUrl":null,"url":null,"abstract":"Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 1","pages":"401 - 414"},"PeriodicalIF":0.9000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0082","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2015-0082","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 3

Abstract

Abstract As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.

查看原文本刊更多论文

高通量测序或蛋白质组学实验中特征子集相关计数数据的模拟框架

作为高通量测序实验数据处理的一部分，产生的计数数据代表了映射到特定基因组区域的读取量。计数数据也出现在质谱实验中，用于检测蛋白质之间的相互作用。为了评估新的计算方法来分析序列计数数据或蛋白质组学实验的光谱计数数据，因此需要人工计数数据。虽然已经提出了一些生成人工测序计数数据的方法，但它们都是模拟单次测序运行，从而忽略了个体基因组特征之间的相关结构，或者它们仅限于特定结构。我们建议从多元正态分布中提取相关数据，并对这些连续数据进行四舍五入，以获得离散计数。在我们的方法中，所需的分布参数可以以不同的方式构造或从实际计数数据估计。由于舍入影响相关结构，我们评估了已经在DNA微阵列人工表达数据中使用的收缩估计器的使用。我们的方法被证明对特征的定义子集(如单个路径或GO类别)的计数模拟很有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology 生物-生化与分子生物学

CiteScore

1.20

自引率

11.10%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.