Rapid Adaptation of Chemical Named Entity Recognition Using Few-Shot Learning and LLM Distillation.

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-05-01 DOI:10.1021/acs.jcim.5c00248

Yue Zhang,Dionisios G Vlachos,Dongxia Liu,Hui Fang

{"title":"Rapid Adaptation of Chemical Named Entity Recognition Using Few-Shot Learning and LLM Distillation.","authors":"Yue Zhang,Dionisios G Vlachos,Dongxia Liu,Hui Fang","doi":"10.1021/acs.jcim.5c00248","DOIUrl":null,"url":null,"abstract":"Named entity recognition (NER) has been widely used in chemical text mining for the automatic identification and extraction of chemical entities. However, existing chemical NER systems primarily focus on scenarios with abundant training data, requiring significant human effort on annotations. This poses challenges for applications in the chemical field, such as catalysis, where many advancements have traditionally relied on trial-and-error investigations and incremental adjustment of variables. This hinders catalysis science and technology progress in addressing emerging energy and environmental crises. In this work, we propose a few-shot NER model that can quickly adapt to extract new types of chemical entities by using only a limited number of annotated examples. Our model employs a metric-learning approach to transfer entity similarity knowledge from high-resource chemical domains (with abundant annotations) to enable effective entity recognition in low-resource specialized domains (limited annotation). We validate the effectiveness of our model on a few-shot chemical NER benchmark built based on six existing chemical NER data sets. Experiments show that the proposed few-shot NER model can achieve reasonable performance with only 5 examples per entity type and shows consistent improvement as the number of examples increases. Furthermore, we demonstrate how the proposed model can be trained with large language model (LLM) annotated data, opening a new pathway for rapid adaptation of NER systems. Our approach leverages the knowledge broadness of large language models for chemistry while distilling this knowledge into a lightweight model suitable for efficient and in-house use.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"39 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c00248","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Named entity recognition (NER) has been widely used in chemical text mining for the automatic identification and extraction of chemical entities. However, existing chemical NER systems primarily focus on scenarios with abundant training data, requiring significant human effort on annotations. This poses challenges for applications in the chemical field, such as catalysis, where many advancements have traditionally relied on trial-and-error investigations and incremental adjustment of variables. This hinders catalysis science and technology progress in addressing emerging energy and environmental crises. In this work, we propose a few-shot NER model that can quickly adapt to extract new types of chemical entities by using only a limited number of annotated examples. Our model employs a metric-learning approach to transfer entity similarity knowledge from high-resource chemical domains (with abundant annotations) to enable effective entity recognition in low-resource specialized domains (limited annotation). We validate the effectiveness of our model on a few-shot chemical NER benchmark built based on six existing chemical NER data sets. Experiments show that the proposed few-shot NER model can achieve reasonable performance with only 5 examples per entity type and shows consistent improvement as the number of examples increases. Furthermore, we demonstrate how the proposed model can be trained with large language model (LLM) annotated data, opening a new pathway for rapid adaptation of NER systems. Our approach leverages the knowledge broadness of large language models for chemistry while distilling this knowledge into a lightweight model suitable for efficient and in-house use.

查看原文本刊更多论文

基于小样本学习和LLM精馏的化学命名实体快速自适应识别。

命名实体识别（NER）在化学文本挖掘中得到了广泛的应用，用于化学实体的自动识别和提取。然而，现有的化学NER系统主要关注具有丰富训练数据的场景，需要大量的人工注释工作。这给化学领域的应用带来了挑战，比如催化领域，传统上许多进步都依赖于试错研究和变量的增量调整。这阻碍了促进科技进步，以应对新出现的能源和环境危机。在这项工作中，我们提出了一个少镜头NER模型，该模型可以快速适应仅使用有限数量的注释示例提取新类型的化学实体。我们的模型采用度量学习方法从高资源化学领域（具有丰富的注释）转移实体相似性知识，从而在低资源专业领域（有限的注释）实现有效的实体识别。我们在基于六个现有化学NER数据集构建的少量化学NER基准上验证了我们模型的有效性。实验表明，所提出的少镜头NER模型可以在每个实体类型只有5个样本的情况下获得合理的性能，并且随着样本数量的增加呈现出一致的提高。此外，我们展示了如何使用大型语言模型（LLM）注释数据训练所提出的模型，为NER系统的快速适应开辟了新的途径。我们的方法利用了化学大型语言模型的知识广度，同时将这些知识提炼成适合于高效和内部使用的轻量级模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.