Journal of Cheminformatics最新文献

筛选
英文 中文
Enhancing molecular property prediction with auxiliary learning and task-specific adaptation 利用辅助学习和特定任务适应性加强分子特性预测
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-24 DOI: 10.1186/s13321-024-00880-7
Vishal Dey, Xia Ning
{"title":"Enhancing molecular property prediction with auxiliary learning and task-specific adaptation","authors":"Vishal Dey,&nbsp;Xia Ning","doi":"10.1186/s13321-024-00880-7","DOIUrl":"10.1186/s13321-024-00880-7","url":null,"abstract":"<div><p>Pretrained Graph Neural Networks have been widely adopted for various molecular property prediction tasks. Despite their ability to encode structural and relational features of molecules, traditional fine-tuning of such pretrained GNNs on the target task can lead to poor generalization. To address this, we explore the adaptation of pretrained GNNs to the target task by jointly training them with multiple auxiliary tasks. This could enable the GNNs to learn both general and task-specific features, which may benefit the target task. However, a major challenge is to determine the relatedness of auxiliary tasks with the target task. To address this, we investigate multiple strategies to measure the relevance of auxiliary tasks and integrate such tasks by adaptively combining task gradients or by learning task weights via bi-level optimization. Additionally, we propose a novel gradient surgery-based approach, Rotation of Conflicting Gradients (<span>(mathop {texttt{RCGrad}}limits)</span>), that learns to align conflicting auxiliary task gradients through rotation. Our experiments with state-of-the-art pretrained GNNs demonstrate the efficacy of our proposed methods, with improvements of up to 7.7% over fine-tuning. This suggests that incorporating auxiliary tasks along with target task fine-tuning can be an effective way to improve the generalizability of pretrained GNNs for molecular property prediction.</p><p><b>Scientific contribution</b></p><p>We introduce a novel framework for adapting pretrained GNNs to molecular tasks using auxiliary learning to address the critical issue of negative transfer. Leveraging novel gradient surgery techniques such as <span>(mathop {texttt{RCGrad}}limits)</span>, the proposed adaptation framework represents a significant departure from the dominant pretraining fine-tuning approach for molecular GNNs. Our contributions are significant for drug discovery research, especially for tasks with limited data, filling a notable gap in the efficient adaptation of pretrained models for molecular GNNs.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00880-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141755347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore 利用构件和反应感知 SAScore 估算分子的合成可达性。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-23 DOI: 10.1186/s13321-024-00879-0
Shuan Chen, Yousung Jung
{"title":"Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore","authors":"Shuan Chen,&nbsp;Yousung Jung","doi":"10.1186/s13321-024-00879-0","DOIUrl":"10.1186/s13321-024-00879-0","url":null,"abstract":"<div><p>Synthetic accessibility prediction is a task to estimate how easily a given molecule might be synthesizable in the laboratory, playing a crucial role in computer-aided molecular design. Although synthesis planning programs can determine synthesis routes, their slow processing times make them impractical for large-scale molecule screening. On the other hand, existing rapid synthesis accessibility estimation methods offer speed but typically lack integration with actual synthesis routes and building block information. In this work, we introduce BR-SAScore, an enhanced version of SAScore that integrates the available building block information (B) and reaction knowledge (R) from synthesis planning programs into the scoring process. In particular, we differentiate fragments inherent in building blocks and fragments to be derived from synthesis (reactions) when scoring synthetic accessibility. Compared to existing methods, our experimental findings demonstrate that BR-SAScore offers more accurate and precise identification of a molecule's synthetic accessibility by the synthesis planning program with a fast calculation time. Moreover, we illustrate how BR-SAScore provides chemically interpretable results, aligning with the capability of the synthesis planning program embedded with the same reaction knowledge and available building blocks.</p><p><b>Scientific contribution</b></p><p>We introduce BR-SAScore, an extension of SAScore, to estimate the synthetic accessibility of molecules by leveraging known building-block and reactivity information. In our experiments, BR-SAScore shows superior prediction performance on predicting molecule synthetic accessibility compared to previous methods, including SAScore and deep-learning models, while requiring significantly less computation time. In addition, we show that BR-SAScore is able to precisely identify the chemical fragment contributing to the synthetic infeasibility, holding great potential for future molecule synthesizability optimization.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11267797/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141750803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
piscesCSM: prediction of anticancer synergistic drug combinations piscesCSM:抗癌协同药物组合预测。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-19 DOI: 10.1186/s13321-024-00859-4
Raghad AlJarf, Carlos H. M. Rodrigues, Yoochan Myung, Douglas E. V. Pires, David B. Ascher
{"title":"piscesCSM: prediction of anticancer synergistic drug combinations","authors":"Raghad AlJarf,&nbsp;Carlos H. M. Rodrigues,&nbsp;Yoochan Myung,&nbsp;Douglas E. V. Pires,&nbsp;David B. Ascher","doi":"10.1186/s13321-024-00859-4","DOIUrl":"10.1186/s13321-024-00859-4","url":null,"abstract":"<p>While drug combination therapies are of great importance, particularly in cancer treatment, identifying novel synergistic drug combinations has been a challenging venture. Computational methods have emerged in this context as a promising tool for prioritizing drug combinations for further evaluation, though they have presented limited performance, utility, and interpretability. Here, we propose a novel predictive tool, piscesCSM, that leverages graph-based representations to model small molecule chemical structures to accurately predict drug combinations with favourable anticancer synergistic effects against one or multiple cancer cell lines. Leveraging these insights, we developed a general supervised machine learning model to guide the prediction of anticancer synergistic drug combinations in over 30 cell lines. It achieved an area under the receiver operating characteristic curve (AUROC) of up to 0.89 on independent non-redundant blind tests, outperforming state-of-the-art approaches on both large-scale oncology screening data and an independent test set generated by AstraZeneca (with more than a 16% improvement in predictive accuracy). Moreover, by exploring the interpretability of our approach, we found that simple physicochemical properties and graph-based signatures are predictive of chemotherapy synergism. To provide a simple and integrated platform to rapidly screen potential candidate pairs with favourable synergistic anticancer effects, we made piscesCSM freely available online at https://biosig.lab.uq.edu.au/piscescsm/ as a web server and API. We believe that our predictive tool will provide a valuable resource for optimizing and augmenting combinatorial screening libraries to identify effective and safe synergistic anticancer drug combinations.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00859-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141726656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reaction rebalancing: a novel approach to curating reaction databases 反应再平衡:整理反应数据库的新方法。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-19 DOI: 10.1186/s13321-024-00875-4
Tieu-Long Phan, Klaus Weinbauer, Thomas Gärtner, Daniel Merkle, Jakob L. Andersen, Rolf Fagerberg, Peter F. Stadler
{"title":"Reaction rebalancing: a novel approach to curating reaction databases","authors":"Tieu-Long Phan,&nbsp;Klaus Weinbauer,&nbsp;Thomas Gärtner,&nbsp;Daniel Merkle,&nbsp;Jakob L. Andersen,&nbsp;Rolf Fagerberg,&nbsp;Peter F. Stadler","doi":"10.1186/s13321-024-00875-4","DOIUrl":"10.1186/s13321-024-00875-4","url":null,"abstract":"<div><h3>Purpose</h3><p>Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need.</p><h3>Methods</h3><p>The <span>SynRBL</span> framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities.</p><h3>Results</h3><p>The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively.</p><h3>Conclusion</h3><p>The <span>SynRBL</span> framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.</p><h3>Scientific Contribution</h3><p><span>SynRBL</span> features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, <span>SynRBL</span> successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, <span>SynRBL</span> achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00875-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141726657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ualign: pushing the limit of template-free retrosynthesis prediction with unsupervised SMILES alignment Ualign:利用无监督 SMILES 对齐技术突破无模板逆合成预测的极限
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-15 DOI: 10.1186/s13321-024-00877-2
Kaipeng Zeng, Bo Yang, Xin Zhao, Yu Zhang, Fan Nie, Xiaokang Yang, Yaohui Jin, Yanyan Xu
{"title":"Ualign: pushing the limit of template-free retrosynthesis prediction with unsupervised SMILES alignment","authors":"Kaipeng Zeng,&nbsp;Bo Yang,&nbsp;Xin Zhao,&nbsp;Yu Zhang,&nbsp;Fan Nie,&nbsp;Xiaokang Yang,&nbsp;Yaohui Jin,&nbsp;Yanyan Xu","doi":"10.1186/s13321-024-00877-2","DOIUrl":"10.1186/s13321-024-00877-2","url":null,"abstract":"<div><h3>Motivation</h3><p>Retrosynthesis planning poses a formidable challenge in the organic chemical industry, particularly in pharmaceuticals. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency.</p><h3>Results</h3><p>This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods.</p><h3>Scientific contribution</h3><p>We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5% (top-5) and 5.4% (top-10) increased accuracy over the strongest baseline.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00877-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141618323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LVPocket: integrated 3D global-local information to protein binding pockets prediction with transfer learning of protein structure classification LVPocket:通过蛋白质结构分类的迁移学习,综合三维全局-局部信息预测蛋白质结合口袋。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-07 DOI: 10.1186/s13321-024-00871-8
Ruifeng Zhou, Jing Fan, Sishu Li, Wenjie Zeng, Yilun Chen, Xiaoshan Zheng, Hongyang Chen, Jun Liao
{"title":"LVPocket: integrated 3D global-local information to protein binding pockets prediction with transfer learning of protein structure classification","authors":"Ruifeng Zhou,&nbsp;Jing Fan,&nbsp;Sishu Li,&nbsp;Wenjie Zeng,&nbsp;Yilun Chen,&nbsp;Xiaoshan Zheng,&nbsp;Hongyang Chen,&nbsp;Jun Liao","doi":"10.1186/s13321-024-00871-8","DOIUrl":"10.1186/s13321-024-00871-8","url":null,"abstract":"<div><h3>Background</h3><p>Previous deep learning methods for predicting protein binding pockets mainly employed 3D convolution, yet an abundance of convolution operations may lead the model to excessively prioritize local information, thus overlooking global information. Moreover, it is essential for us to account for the influence of diverse protein folding structural classes. Because proteins classified differently structurally exhibit varying biological functions, whereas those within the same structural class share similar functional attributes.</p><h3>Results</h3><p>We proposed LVPocket, a novel method that synergistically captures both local and global information of protein structure through the integration of Transformer encoders, which help the model achieve better performance in binding pockets prediction. And then we tailored prediction models for data of four distinct structural classes of proteins using the transfer learning. The four fine-tuned models were trained on the baseline LVPocket model which was trained on the sc-PDB dataset. LVPocket exhibits superior performance on three independent datasets compared to current state-of-the-art methods. Additionally, the fine-tuned model outperforms the baseline model in terms of performance.</p><h3>Scientific contribution</h3><p>We present a novel model structure for predicting protein binding pockets that provides a solution for relying on extensive convolutional computation while neglecting global information about protein structures. Furthermore, we tackle the impact of different protein folding structures on binding pocket prediction tasks through the application of transfer learning methods.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00871-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141553971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture 通过增强型 DECIMER 架构推进手绘化学结构识别。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-05 DOI: 10.1186/s13321-024-00872-7
Kohulan Rajan, Henning Otto Brinkhaus, Achim Zielesny, Christoph Steinbeck
{"title":"Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture","authors":"Kohulan Rajan,&nbsp;Henning Otto Brinkhaus,&nbsp;Achim Zielesny,&nbsp;Christoph Steinbeck","doi":"10.1186/s13321-024-00872-7","DOIUrl":"10.1186/s13321-024-00872-7","url":null,"abstract":"<p>Accurate recognition of hand-drawn chemical structures is crucial for digitising hand-written chemical information in traditional laboratory notebooks or facilitating stylus-based structure entry on tablets or smartphones. However, the inherent variability in hand-drawn structures poses challenges for existing Optical Chemical Structure Recognition (OCSR) software. To address this, we present an enhanced Deep lEarning for Chemical ImagE Recognition (DECIMER) architecture that leverages a combination of Convolutional Neural Networks (CNNs) and Transformers to improve the recognition of hand-drawn chemical structures. The model incorporates an EfficientNetV2 CNN encoder that extracts features from hand-drawn images, followed by a Transformer decoder that converts the extracted features into Simplified Molecular Input Line Entry System (SMILES) strings. Our models were trained using synthetic hand-drawn images generated by RanDepict, a tool for depicting chemical structures with different style elements. A benchmark was performed using a real-world dataset of hand-drawn chemical structures to evaluate the model's performance. The results indicate that our improved DECIMER architecture exhibits a significantly enhanced recognition accuracy compared to other approaches.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00872-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141537384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PromptSMILES: prompting for scaffold decoration and fragment linking in chemical language models PromptSMILES:提示化学语言模型中的支架装饰和片段连接
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-04 DOI: 10.1186/s13321-024-00866-5
Morgan Thomas, Mazen Ahmad, Gary Tresadern, Gianni de Fabritiis
{"title":"PromptSMILES: prompting for scaffold decoration and fragment linking in chemical language models","authors":"Morgan Thomas,&nbsp;Mazen Ahmad,&nbsp;Gary Tresadern,&nbsp;Gianni de Fabritiis","doi":"10.1186/s13321-024-00866-5","DOIUrl":"10.1186/s13321-024-00866-5","url":null,"abstract":"<div><p>SMILES-based generative models are amongst the most robust and successful recent methods used to augment drug design. They are typically used for complete de novo generation, however, scaffold decoration and fragment linking applications are sometimes desirable which requires a different grammar, architecture, training dataset and therefore, re-training of a new model. In this work, we describe a simple procedure to conduct constrained molecule generation with a SMILES-based generative model to extend applicability to scaffold decoration and fragment linking by providing SMILES prompts, without the need for re-training. In combination with reinforcement learning, we show that pre-trained, decoder-only models adapt to these applications quickly and can further optimize molecule generation towards a specified objective. We compare the performance of this approach to a variety of orthogonal approaches and show that performance is comparable or better. For convenience, we provide an easy-to-use python package to facilitate model sampling which can be found on GitHub and the Python Package Index.</p><p><b>Scientific contribution</b></p><p>This novel method extends an autoregressive chemical language model to scaffold decoration and fragment linking scenarios. This doesn’t require re-training, the use of a bespoke grammar, or curation of a custom dataset, as commonly required by other approaches.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00866-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of machine reading comprehension techniques for named entity recognition in materials science 将机器阅读理解技术应用于材料科学中的命名实体识别。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-07-02 DOI: 10.1186/s13321-024-00874-5
Zihui Huang, Liqiang He, Yuhang Yang, Andi Li, Zhiwen Zhang, Siwei Wu, Yang Wang, Yan He, Xujie Liu
{"title":"Application of machine reading comprehension techniques for named entity recognition in materials science","authors":"Zihui Huang,&nbsp;Liqiang He,&nbsp;Yuhang Yang,&nbsp;Andi Li,&nbsp;Zhiwen Zhang,&nbsp;Siwei Wu,&nbsp;Yang Wang,&nbsp;Yan He,&nbsp;Xujie Liu","doi":"10.1186/s13321-024-00874-5","DOIUrl":"10.1186/s13321-024-00874-5","url":null,"abstract":"<div><p>Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.</p><p><b>Scientific contribution</b></p><p>We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00874-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141490382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPSign: conformal prediction for cheminformatics modeling CPSign:用于化学信息学建模的保形预测
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-06-28 DOI: 10.1186/s13321-024-00870-9
Staffan Arvidsson McShane, Ulf Norinder, Jonathan Alvarsson, Ernst Ahlberg, Lars Carlsson, Ola Spjuth
{"title":"CPSign: conformal prediction for cheminformatics modeling","authors":"Staffan Arvidsson McShane,&nbsp;Ulf Norinder,&nbsp;Jonathan Alvarsson,&nbsp;Ernst Ahlberg,&nbsp;Lars Carlsson,&nbsp;Ola Spjuth","doi":"10.1186/s13321-024-00870-9","DOIUrl":"10.1186/s13321-024-00870-9","url":null,"abstract":"<div><p>Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign.</p><p><b>Scientific contribution</b></p><p> CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance—showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a state-of-the-art deep learning based model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00870-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141462387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信