DMLS: an automated pipeline to extract the Drosophila modular transcription regulators and targets from massive literature articles.

IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Tzu-Hsien Yang, Yu-Huai Yu, Sheng-Hang Wu, Fang-Yuan Chang, Hsiu-Chun Tsai, Ya-Chiao Yang
{"title":"DMLS: an automated pipeline to extract the Drosophila modular transcription regulators and targets from massive literature articles.","authors":"Tzu-Hsien Yang, Yu-Huai Yu, Sheng-Hang Wu, Fang-Yuan Chang, Hsiu-Chun Tsai, Ya-Chiao Yang","doi":"10.1093/database/baae049","DOIUrl":null,"url":null,"abstract":"<p><p>Transcription regulation in multicellular species is mediated by modular transcription factor (TF) binding site combinations termed cis-regulatory modules (CRMs). Such CRM-mediated transcription regulation determines the gene expression patterns during development. Biologists frequently investigate CRM transcription regulation on gene expressions. However, the knowledge of the target genes and regulatory TFs participating in the CRMs under study is mostly fragmentary throughout the literature. Researchers need to afford tremendous human resources to fully surf through the articles deposited in biomedical literature databases in order to obtain the information. Although several novel text-mining systems are now available for literature triaging, these tools do not specifically focus on CRM-related literature prescreening, failing to correctly extract the information of the CRM target genes and regulatory TFs from the literature. For this reason, we constructed a supportive auto-literature prescreener called Drosophila Modular transcription-regulation Literature Screener (DMLS) that achieves the following: (i) prescreens articles describing experiments on modular transcription regulation, (ii) identifies the described target genes and TFs of the CRMs under study for each modular transcription-regulation-describing article and (iii) features an automated and extendable pipeline to perform the task. We demonstrated that the final performance of DMLS in extracting the described target gene and regulatory TF lists of CRMs under study for given articles achieved test macro area under the ROC curve (auROC) = 89.7% and area under the precision-recall curve (auPRC) = 77.6%, outperforming the intuitive gene name-occurrence-counting method by at least 19.9% in auROC and 30.5% in auPRC. The web service and the command line versions of DMLS are available at https://cobis.bme.ncku.edu.tw/DMLS/  and  https://github.com/cobisLab/DMLS/, respectively. Database Tool URL: https://cobis.bme.ncku.edu.tw/DMLS/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":"0"},"PeriodicalIF":3.4000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11188685/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae049","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Transcription regulation in multicellular species is mediated by modular transcription factor (TF) binding site combinations termed cis-regulatory modules (CRMs). Such CRM-mediated transcription regulation determines the gene expression patterns during development. Biologists frequently investigate CRM transcription regulation on gene expressions. However, the knowledge of the target genes and regulatory TFs participating in the CRMs under study is mostly fragmentary throughout the literature. Researchers need to afford tremendous human resources to fully surf through the articles deposited in biomedical literature databases in order to obtain the information. Although several novel text-mining systems are now available for literature triaging, these tools do not specifically focus on CRM-related literature prescreening, failing to correctly extract the information of the CRM target genes and regulatory TFs from the literature. For this reason, we constructed a supportive auto-literature prescreener called Drosophila Modular transcription-regulation Literature Screener (DMLS) that achieves the following: (i) prescreens articles describing experiments on modular transcription regulation, (ii) identifies the described target genes and TFs of the CRMs under study for each modular transcription-regulation-describing article and (iii) features an automated and extendable pipeline to perform the task. We demonstrated that the final performance of DMLS in extracting the described target gene and regulatory TF lists of CRMs under study for given articles achieved test macro area under the ROC curve (auROC) = 89.7% and area under the precision-recall curve (auPRC) = 77.6%, outperforming the intuitive gene name-occurrence-counting method by at least 19.9% in auROC and 30.5% in auPRC. The web service and the command line versions of DMLS are available at https://cobis.bme.ncku.edu.tw/DMLS/  and  https://github.com/cobisLab/DMLS/, respectively. Database Tool URL: https://cobis.bme.ncku.edu.tw/DMLS/.

DMLS:从海量文献文章中提取果蝇模块化转录调节因子和靶标的自动管道。
多细胞物种的转录调控是由称为顺式调控模块(CRM)的模块化转录因子(TF)结合位点组合介导的。这种由 CRM 介导的转录调控决定了发育过程中的基因表达模式。生物学家经常研究 CRM 对基因表达的转录调控。然而,有关参与所研究的 CRM 的靶基因和调控 TF 的知识在文献中大多比较零散。研究人员需要花费大量的人力物力,才能全面浏览生物医学文献数据库中的文章,从而获得相关信息。尽管目前已有一些新颖的文本挖掘系统可用于文献分拣,但这些工具并没有专门针对CRM相关文献进行预筛选,无法从文献中正确提取CRM靶基因和调控TFs的信息。为此,我们构建了一个名为果蝇模块化转录调控文献预筛选器(DMLS)的支持性自动文献预筛选器,可实现以下功能:(i) 对描述模块化转录调控实验的文章进行预筛选;(ii) 识别每篇描述模块化转录调控的文章所描述的目标基因和所研究的 CRM 的 TF;(iii) 采用自动化和可扩展的管道来执行任务。我们证明了 DMLS 在提取给定文章中被研究 CRM 的目标基因和调控 TF 列表方面的最终性能,其测试宏 ROC 曲线下面积(auROC)= 89.7%,精度-调用曲线下面积(auPRC)= 77.6%,在 auROC 和 auPRC 方面分别比直观的基因名称-出现-计数法高出至少 19.9% 和 30.5%。DMLS 的网络服务和命令行版本分别可在 https://cobis.bme.ncku.edu.tw/DMLS/ 和 https://github.com/cobisLab/DMLS/ 上获得。数据库工具网址:https://cobis.bme.ncku.edu.tw/DMLS/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Database: The Journal of Biological Databases and Curation
Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
9.00
自引率
3.40%
发文量
100
审稿时长
>12 weeks
期刊介绍: Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信