High-Throughput Screening Assay Datasets from the PubChem Database.

Chemical informatics (Wilmington, Del.) Pub Date : 2017-01-01 Epub Date: 2017-04-26

Mariusz Butkiewicz, Yanli Wang, Stephen H Bryant, Edward W Lowe, David C Weaver, Jens Meiler

{"title":"High-Throughput Screening Assay Datasets from the PubChem Database.","authors":"Mariusz Butkiewicz, Yanli Wang, Stephen H Bryant, Edward W Lowe, David C Weaver, Jens Meiler","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Availability of high-throughput screening (HTS) data in the public domain offers great potential to foster development of ligand-based computer-aided drug discovery (LB-CADD) methods crucial for drug discovery efforts in academia and industry. LB-CADD method development depends on high-quality HTS assay data, i.e., datasets that contain both active and inactive compounds. These active compounds are hits from primary screens that have been tested in concentration-response experiments and where the target-specificity of the hits has been validated through suitable secondary screening experiments. Publicly available HTS repositories such as PubChem often provide such data in a convoluted way: compounds that are classified as inactive need to be extracted from the primary screening record. However, compounds classified as active in the primary screening record are not suitable as a set of active compounds for LB-CADD experiments due to high false-positive rate. A suitable set of actives can be derived by carefully analysing results in often up to five or more assays that are used to confirm and classify the activity of compounds. These assays, in part, build on each other. However, often not all hit compounds from the previous screen have been tested. Sometimes a compound can be classified as 'active', though its meaning is 'inactive' on the target of interest as it is 'active' on a different target protein. Here, a curation process of hierarchically related confirmatory screens is illustrated based on two specifically chosen protein use-cases. The subsequent re-upload procedure into PubChem is described for the findings of those two scenarios. Further, we provide nine publicly accessible high quality datasets for future LB-CADD method development that provide a common baseline for comparison of future methods to the scientific community. We also provide a protocol researchers can follow to upload additional datasets for benchmarking.</p>","PeriodicalId":92340,"journal":{"name":"Chemical informatics (Wilmington, Del.)","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/e1/a7/nihms936862.PMC5962024.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical informatics (Wilmington, Del.)","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/4/26 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Availability of high-throughput screening (HTS) data in the public domain offers great potential to foster development of ligand-based computer-aided drug discovery (LB-CADD) methods crucial for drug discovery efforts in academia and industry. LB-CADD method development depends on high-quality HTS assay data, i.e., datasets that contain both active and inactive compounds. These active compounds are hits from primary screens that have been tested in concentration-response experiments and where the target-specificity of the hits has been validated through suitable secondary screening experiments. Publicly available HTS repositories such as PubChem often provide such data in a convoluted way: compounds that are classified as inactive need to be extracted from the primary screening record. However, compounds classified as active in the primary screening record are not suitable as a set of active compounds for LB-CADD experiments due to high false-positive rate. A suitable set of actives can be derived by carefully analysing results in often up to five or more assays that are used to confirm and classify the activity of compounds. These assays, in part, build on each other. However, often not all hit compounds from the previous screen have been tested. Sometimes a compound can be classified as 'active', though its meaning is 'inactive' on the target of interest as it is 'active' on a different target protein. Here, a curation process of hierarchically related confirmatory screens is illustrated based on two specifically chosen protein use-cases. The subsequent re-upload procedure into PubChem is described for the findings of those two scenarios. Further, we provide nine publicly accessible high quality datasets for future LB-CADD method development that provide a common baseline for comparison of future methods to the scientific community. We also provide a protocol researchers can follow to upload additional datasets for benchmarking.

Abstract Image

本刊更多论文

来自PubChem数据库的高通量筛选分析数据集。

公共领域高通量筛选(HTS)数据的可用性为促进基于配体的计算机辅助药物发现(LB-CADD)方法的发展提供了巨大的潜力，这对学术界和工业界的药物发现工作至关重要。LB-CADD方法的开发依赖于高质量的HTS分析数据，即包含活性和非活性化合物的数据集。这些活性化合物是经过浓度-反应实验测试的初级筛选命中的，并且命中的目标特异性已通过适当的二级筛选实验验证。公开可用的HTS存储库(如PubChem)通常以一种复杂的方式提供此类数据:需要从主要筛选记录中提取被归类为非活性的化合物。然而，在初筛记录中被分类为活性的化合物由于假阳性率高，不适合作为LB-CADD实验的一组活性化合物。通过仔细分析通常多达五次或更多次用于确认和分类化合物活性的测定结果，可以得出一组合适的活性。这些分析在某种程度上是建立在彼此的基础上的。然而，通常并不是所有之前筛选的成功化合物都经过了测试。有时一种化合物可以被归类为“活性”，尽管它的意思是对感兴趣的目标“不活跃”，因为它对不同的目标蛋白质是“活跃”的。在这里，基于两个特定选择的蛋白质用例，说明了分层相关的验证性屏幕的管理过程。《PubChem》随后的重新上传过程描述了这两种场景的发现。此外，我们为未来的LB-CADD方法开发提供了9个公开可访问的高质量数据集，为未来的方法与科学界的比较提供了一个共同的基线。我们还提供了一个研究人员可以遵循的协议，以上传额外的数据集进行基准测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chemical informatics (Wilmington, Del.)

自引率

0.00%

发文量