{"title":"A knowledge-graph-based pharmaceutical engineering chatbot for drug discovery","authors":"Naz Pinar Taskiran, Chia-En Jacklyn Tsai, Shuxin Huang, Arijit Chakraborty, Venkat Venkatasubramanian","doi":"10.1016/j.compchemeng.2025.109318","DOIUrl":null,"url":null,"abstract":"<div><div>Despite their success in day-to-day applications, ChatGPT and other large language models (LLMs) have not covered as much ground in scientific and engineering domains. One key challenge is the abundance of domain-specific terminology, which an LLM is not trained to extract in accordance with the underlying physical laws. Such black-box models can also lead to unreliable results or hallucinations. Hybrid AI, which combines data-driven and symbolic methods, leverages domain knowledge to add explainability and reliability to answers. Our group has previously developed a domain-informed ontology-based information extraction tool called SUSIE, which extracts key terms and their context to present them to the user as knowledge graphs (KGs). Although KGs are used to visualize relationships between different entities, they are not easily accessible for user questions. However, they serve as a structured input for LLMs. Thus, KGs can efficiently query a corpus of pharmaceutical documents, streamlining drug discovery and manufacturing processes. In this work, we propose methods to improve the information extraction capabilities of SUSIE by expanding its knowledge base and improving its ability to understand scientific material through a sentence-restructuring module. Additionally, we present a customized question-and-answer module that enables the user to query from generated KGs and get an answer in natural language. Unlike black-box models such as those purely powered by OpenAI’s models and the LangChain GraphQA packages, combining our KGs with Neo4j limits hallucinations and provides reliable and traceable answers in a user-friendly chatbot interface.</div></div>","PeriodicalId":286,"journal":{"name":"Computers & Chemical Engineering","volume":"203 ","pages":"Article 109318"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098135425003205","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Despite their success in day-to-day applications, ChatGPT and other large language models (LLMs) have not covered as much ground in scientific and engineering domains. One key challenge is the abundance of domain-specific terminology, which an LLM is not trained to extract in accordance with the underlying physical laws. Such black-box models can also lead to unreliable results or hallucinations. Hybrid AI, which combines data-driven and symbolic methods, leverages domain knowledge to add explainability and reliability to answers. Our group has previously developed a domain-informed ontology-based information extraction tool called SUSIE, which extracts key terms and their context to present them to the user as knowledge graphs (KGs). Although KGs are used to visualize relationships between different entities, they are not easily accessible for user questions. However, they serve as a structured input for LLMs. Thus, KGs can efficiently query a corpus of pharmaceutical documents, streamlining drug discovery and manufacturing processes. In this work, we propose methods to improve the information extraction capabilities of SUSIE by expanding its knowledge base and improving its ability to understand scientific material through a sentence-restructuring module. Additionally, we present a customized question-and-answer module that enables the user to query from generated KGs and get an answer in natural language. Unlike black-box models such as those purely powered by OpenAI’s models and the LangChain GraphQA packages, combining our KGs with Neo4j limits hallucinations and provides reliable and traceable answers in a user-friendly chatbot interface.
期刊介绍:
Computers & Chemical Engineering is primarily a journal of record for new developments in the application of computing and systems technology to chemical engineering problems.