Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang
{"title":"用于语义医疗保健任务中基准嵌入模型的可扩展框架。","authors":"Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang","doi":"10.1093/jamia/ocaf149","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.</p><p><strong>Materials and methods: </strong>We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.</p><p><strong>Results: </strong>In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.</p><p><strong>Discussion: </strong>The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.</p><p><strong>Conclusion: </strong>By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A scalable framework for benchmark embedding models in semantic health-care tasks.\",\"authors\":\"Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang\",\"doi\":\"10.1093/jamia/ocaf149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.</p><p><strong>Materials and methods: </strong>We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.</p><p><strong>Results: </strong>In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.</p><p><strong>Discussion: </strong>The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.</p><p><strong>Conclusion: </strong>By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).</p>\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocaf149\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf149","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
目的:文本嵌入在语义任务中很有前途,例如检索增强生成(RAG)。然而,由于缺乏基准方法,它们在医疗保健中的应用尚未得到充分探索。我们引入了一种可扩展的基准测试方法来测试医疗保健语义任务的嵌入。材料和方法:我们使用不同的数据集评估了7个医学语义相似度任务中的39个嵌入模型。这些数据集包括真实世界的患者数据(来自Mount Sinai Health System和MIMIC IV)、PubMed的生物医学文献以及Llama-3-70b生成的合成数据。我们首先通过使用Spearman秩相关将模型生成的相似性得分与噪声水平相关联来评估语义文本相似性(STS)。然后,我们将相同的任务重新定义为检索问题,通过k的平均对等等级和召回率进行评估。结果:在STS和检索的7个任务中评估2000个文本对总共产生了328万个模型评估。较大的模型(bbb7b参数),如基于Mistral-7b和Gemma-2-9b的模型,一直表现良好,特别是在长上下文任务中。NV-Embed-v1模型(7b个参数)虽然在短任务中表现最好,但在长任务中表现不佳。对于较短的任务,较小的模型(如blade -embed (335M参数))的执行与较大的模型相当。对于较长的检索任务,较大的模型明显优于较小的模型。讨论:拟议的基准框架展示了可伸缩性和灵活性,提供了一种结构化的方法来指导为广泛的医疗保健任务选择嵌入模型。结论:通过将适当的模型与任务相匹配,该框架能够更有效地部署嵌入模型,增强关键应用,如语义搜索和检索增强生成(RAG)。
A scalable framework for benchmark embedding models in semantic health-care tasks.
Objectives: Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.
Materials and methods: We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.
Results: In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.
Discussion: The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.
Conclusion: By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).
期刊介绍:
JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.