Topic analysis on publications and patents toward fully automated translational science benefits model impact extraction.

IF 1.6

Frontiers in research metrics and analytics Pub Date : 2025-09-23 eCollection Date: 2025-01-01 DOI:10.3389/frma.2025.1596687

Tejaswini Manjunath, Eline Appelmans, Sinem Balta, Dominick DiMercurio, Claudia Avalos, Karen Stark

{"title":"Topic analysis on publications and patents toward fully automated translational science benefits model impact extraction.","authors":"Tejaswini Manjunath, Eline Appelmans, Sinem Balta, Dominick DiMercurio, Claudia Avalos, Karen Stark","doi":"10.3389/frma.2025.1596687","DOIUrl":null,"url":null,"abstract":"Background: The Clinical and Translational Science Award (CTSA) program, funded by the National Center for Advancing Translational Sciences (NCATS), has supported over 65 hubs, generating 118,490 publications from 2006 to 2021. Measuring the impact of these outputs remains challenging, as traditional bibliometric methods fail to capture patents, policy contributions, and clinical implementation. The Translational Science Benefits Model (TSBM) provides a structured framework for assessing clinical, community, economic, and policy benefits, but its manual application is resource-intensive. Advances in Natural Language Processing (NLP) and Artificial Intelligence (AI) offer a scalable solution for automating benefit extraction from large research datasets.Objective: This study presents an NLP-driven pipeline that automates the extraction of TSBM benefits from research outputs using Latent Dirichlet Allocation (LDA) topic modeling to enable efficient, scalable, and reproducible impact analysis. The application of NLP allows the discovery of topics and benefits to emerge from the very large corpus of CTSA documents without requiring directed searches or preconceived benefits for data mining.Methods: We applied LDA topic modeling to publications, patents, and grants and mapped the topics to TSBM benefits using subject matter expert (SME) validation. Impact visualizations, including heatmaps and t-SNE plots, highlighted benefit distributions across the corpus and CTSA hubs.Results: Spanning CTSA hub grants awarded from 2006 to 2023, our analysis corpus comprised 1,296 projects, 127,958 publications and 352 patents. Applying our NLP-driven pipeline to deduplicated data, we found that clinical and community benefits were the most frequently extracted benefits from publications and projects, reflecting the patient-centered and community-driven nature of CTSA research. Economic and policy benefits were less frequently identified, prompting the inclusion of patent data to better capture commercialization impacts. The Publications LDA Model proved the most effective for benefit extraction for publications and projects. All patents were automatically tagged as economic benefits, given their intrinsic focus on commercialization and in accordance with TSBM guidelines.Conclusion: Automated NLP-driven benefit extraction enabled a data-driven approach to applying the TSBM at the scale of the entire CTSA program outputs.","PeriodicalId":73104,"journal":{"name":"Frontiers in research metrics and analytics","volume":"10 ","pages":"1596687"},"PeriodicalIF":1.6000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12500706/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in research metrics and analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frma.2025.1596687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The Clinical and Translational Science Award (CTSA) program, funded by the National Center for Advancing Translational Sciences (NCATS), has supported over 65 hubs, generating 118,490 publications from 2006 to 2021. Measuring the impact of these outputs remains challenging, as traditional bibliometric methods fail to capture patents, policy contributions, and clinical implementation. The Translational Science Benefits Model (TSBM) provides a structured framework for assessing clinical, community, economic, and policy benefits, but its manual application is resource-intensive. Advances in Natural Language Processing (NLP) and Artificial Intelligence (AI) offer a scalable solution for automating benefit extraction from large research datasets.

Objective: This study presents an NLP-driven pipeline that automates the extraction of TSBM benefits from research outputs using Latent Dirichlet Allocation (LDA) topic modeling to enable efficient, scalable, and reproducible impact analysis. The application of NLP allows the discovery of topics and benefits to emerge from the very large corpus of CTSA documents without requiring directed searches or preconceived benefits for data mining.

Methods: We applied LDA topic modeling to publications, patents, and grants and mapped the topics to TSBM benefits using subject matter expert (SME) validation. Impact visualizations, including heatmaps and t-SNE plots, highlighted benefit distributions across the corpus and CTSA hubs.

Results: Spanning CTSA hub grants awarded from 2006 to 2023, our analysis corpus comprised 1,296 projects, 127,958 publications and 352 patents. Applying our NLP-driven pipeline to deduplicated data, we found that clinical and community benefits were the most frequently extracted benefits from publications and projects, reflecting the patient-centered and community-driven nature of CTSA research. Economic and policy benefits were less frequently identified, prompting the inclusion of patent data to better capture commercialization impacts. The Publications LDA Model proved the most effective for benefit extraction for publications and projects. All patents were automatically tagged as economic benefits, given their intrinsic focus on commercialization and in accordance with TSBM guidelines.

Conclusion: Automated NLP-driven benefit extraction enabled a data-driven approach to applying the TSBM at the scale of the entire CTSA program outputs.

查看原文本刊更多论文

面向全自动转化科学效益模型影响提取的出版物和专利的主题分析。

背景：临床和转化科学奖（CTSA）项目由国家促进转化科学中心（NCATS）资助，支持了65个中心，从2006年到2021年产生了118,490篇出版物。衡量这些产出的影响仍然具有挑战性，因为传统的文献计量方法无法捕捉专利、政策贡献和临床实施。转化科学效益模型（TSBM）为评估临床、社区、经济和政策效益提供了一个结构化的框架，但它的手工应用是资源密集型的。自然语言处理（NLP）和人工智能（AI）的进展为从大型研究数据集中自动提取效益提供了可扩展的解决方案。目的：本研究提出了一个nlp驱动的管道，该管道使用潜在狄利克雷分配（Latent Dirichlet Allocation， LDA）主题建模从研究成果中自动提取TSBM收益，以实现高效、可扩展和可重复的影响分析。NLP的应用允许从非常大的CTSA文档语料库中发现主题和利益，而不需要直接搜索或数据挖掘的先入为主的利益。方法：我们将LDA主题建模应用于出版物、专利和授权，并使用主题专家（SME）验证将主题映射到TSBM收益。影响可视化，包括热图和t-SNE图，突出了语料库和CTSA中心之间的利益分布。结果：从2006年到2023年，我们的分析语料库包括1296个项目，127958篇出版物和352项专利。将我们的nlp驱动的管道应用于重复数据，我们发现临床和社区效益是最常见的从出版物和项目中提取的效益，反映了CTSA研究以患者为中心和社区驱动的性质。经济和政策利益较少被确定，促使纳入专利数据以更好地捕捉商业化影响。事实证明，出版物LDA模型对出版物和项目的利益提取最为有效。所有专利都被自动标记为经济效益，因为它们的内在重点是商业化，并符合TSBM指南。结论：自动nlp驱动的效益提取使数据驱动的方法能够在整个CTSA项目输出的规模上应用TSBM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊