Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang
{"title":"A scalable framework for benchmark embedding models in semantic health-care tasks.","authors":"Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang","doi":"10.1093/jamia/ocaf149","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.</p><p><strong>Materials and methods: </strong>We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.</p><p><strong>Results: </strong>In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.</p><p><strong>Discussion: </strong>The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.</p><p><strong>Conclusion: </strong>By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf149","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.
Materials and methods: We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.
Results: In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.
Discussion: The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.
Conclusion: By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).
期刊介绍:
JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.