Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae.

IF 1.4

Integrative biology : quantitative biosciences from nano to macro Pub Date : 2012-07-01 Epub Date: 2012-04-23 DOI:10.1039/c2ib00123c

Katherine James, Anil Wipat, Jennifer Hallinan

{"title":"Is newer better?--evaluating the effects of data curation on integrated analyses in Saccharomyces cerevisiae.","authors":"Katherine James, Anil Wipat, Jennifer Hallinan","doi":"10.1039/c2ib00123c","DOIUrl":null,"url":null,"abstract":"<p><p>Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.</p>","PeriodicalId":520649,"journal":{"name":"Integrative biology : quantitative biosciences from nano to macro","volume":" ","pages":"715-27"},"PeriodicalIF":1.4000,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Integrative biology : quantitative biosciences from nano to macro","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1039/c2ib00123c","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2012/4/23 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent high-throughput experiments have produced a wealth of heterogeneous datasets, each of which provides information about different aspects of the cell. Consequently, integration of diverse data types is essential in order to address many biological questions. The quality of any integrated analysis system is dependent upon the quality of its component data, and upon the Gold Standard data used to evaluate it. It is commonly assumed that the quality of data improves as databases grow and change, particularly for manually curated databases. However, the validity of this assumption can be questioned, given the constant changes in the data coupled with the high level of noise associated with high-throughput experimental techniques. One of the most powerful approaches to data integration is the use of Probabilistic Functional Integrated Networks (PFINs). Here, we systematically analyse the changes in four highly-curated and widely-used online databases and evaluate the extent to which these changes affect the protein function prediction performance of PFINs in the yeast Saccharomyces cerevisiae. We find that the global trend in network performance improves over time. Where individual areas of biology are concerned, however, the most recent files do not always produce the best results. Individual datasets have unique biases towards different biological processes and by selecting and integrating relevant datasets performance can be improved. When using any type of integrated system to answer a specific biological question careful selection of raw data and Gold Standard is vital, since the most recent data may not be the most appropriate.

查看原文本刊更多论文

更新的更好吗？——评估数据管理对酿酒酵母综合分析的影响。

最近的高通量实验产生了丰富的异构数据集，每个数据集都提供了关于细胞不同方面的信息。因此，为了解决许多生物学问题，整合不同的数据类型是必不可少的。任何集成分析系统的质量都依赖于其组成数据的质量，以及用于评估它的黄金标准数据。通常认为数据质量会随着数据库的增长和变化而提高，特别是对于手动管理的数据库。然而，考虑到数据的不断变化以及与高通量实验技术相关的高水平噪声，这种假设的有效性可能受到质疑。最强大的数据集成方法之一是使用概率功能集成网络（PFINs）。在这里，我们系统地分析了四个高度整理和广泛使用的在线数据库的变化，并评估了这些变化对酵母PFINs蛋白质功能预测性能的影响程度。我们发现，随着时间的推移，网络性能的全球趋势有所改善。然而，就生物学的个别领域而言，最新的文件并不总是产生最好的结果。单个数据集对不同的生物过程有独特的偏差，通过选择和整合相关数据集可以提高性能。当使用任何类型的集成系统来回答一个特定的生物学问题时，仔细选择原始数据和金标准是至关重要的，因为最近的数据可能不是最合适的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Integrative biology : quantitative biosciences from nano to macro

自引率

0.00%

发文量