Oishi Banerjee, Lucas Bijnens, Subathra Adithan, Pranav Rajpurkar
{"title":"The Intention-Execution Disconnect in Medical AI: The ReXecution Framework for Evaluating Real-World Clinical Performance.","authors":"Oishi Banerjee, Lucas Bijnens, Subathra Adithan, Pranav Rajpurkar","doi":"10.1142/9789819824755_0021","DOIUrl":"10.1142/9789819824755_0021","url":null,"abstract":"<p><p>We present the ReXecution framework for conducting clinician-centered assessments of medical AI assistants, providing detailed insights into their reliability in realistic clinical settings. Using this framework, we assessed AI assistants for chest X-ray (CXR) interpretation, exploring the gap between current model capabilities and real-world radiological needs. Unlike prior benchmarks that rely on automatically generated questions with limited clinical relevance, our dataset consists of 100 expert-curated tasks that radiologists might realistically present to an AI assistant in their day-to-day workflow. Through detailed manual review by a radiologist, we evaluated two leading foundation models, ChatGPTo3 and MedGemma, on our tasks. While both models demonstrated considerable medical knowledge and reasoning capabilities on our tasks, they frequently struggled to interpret images and execute tasks accurately, producing correct outputs in only 5-10% of cases. Our detailed manual evaluation highlights a critical mismatch: models often abstractly understand radiology concepts but cannot reliably execute their plans when interpreting specific medical images. This work identifies key gaps in current models' ability to serve as comprehensive radiology assistants and provides insights into how the development and evaluation of models can better align with real-world clinician needs, enabling seamless clinician-AI collaboration.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"294-308"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Viggiano, Wenhui Sophia Lu, Xiaowei Zhang, Luis S Mille-Fragoso, Xiaojing J Gao, Euan Ashley, Wing Hung Wong
{"title":"Steering Protein Generative Models at Test-Time for Guided AAV2 Capsid Design.","authors":"Ben Viggiano, Wenhui Sophia Lu, Xiaowei Zhang, Luis S Mille-Fragoso, Xiaojing J Gao, Euan Ashley, Wing Hung Wong","doi":"10.1142/9789819824755_0031","DOIUrl":"10.1142/9789819824755_0031","url":null,"abstract":"<p><p>Recent advances in protein generative models have created new opportunities for protein engineering. However, a significant challenge remains in effectively steering these models to generate sequences with specific, desired functionalities, especially when these properties are defined by \"black-box\" or non-differentiable fitness functions. To address this, we present ProVADA+, a model-agnostic framework that guides pretrained generative models at testtime without costly retraining. Our approach introduces a reinforcement learning-based adaptive masking technique (MADA-DUCB) that significantly accelerates convergence. We demonstrate this framework on the challenging task of designing novel Adeno-Associated Virus 2 (AAV2) capsids. By coupling a ProteinMPNN generative prior with a fine-tuned AAV viability oracle, our method successfully navigates the rugged fitness landscape where unguided random mutagenesis is ineffective-with prior experiments showing as few as 0.3% of variants with six or more mutations are viable. In its final iterations, ProVADA generated a pool of novel candidates with a mean viral selection score of 2.72, consistently scoring highly viable variants while maintaining a diverse range of sequence similarity to the wildtype sequence. Our results show that ProVADA provides a powerful and efficient framework for accelerating the design of proteins with complex, user-defined properties.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"438-451"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12952671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors.","authors":"Romain Hardy, Tyler M Berzin, Pranav Rajpurkar","doi":"10.1142/9789819824755_0003","DOIUrl":"10.1142/9789819824755_0003","url":null,"abstract":"<p><p>Three-dimensional (3D) scene understanding in colonoscopy presents significant challenges that necessitate automated methods for accurate depth estimation. However, existing depth estimation models for endoscopy struggle with temporal consistency across video sequences, limiting their applicability for 3D reconstruction.We present ColonCrafter, a diffusion-based depth estimation model that generates temporally consistent depth maps from monocular colonoscopy videos. Our approach learns robust geometric priors from synthetic colonoscopy sequences, enabling reliable depth estimation across frames. We also introduce a style transfer technique that preserves geometric structure while adapting realistic clinical videos to match our synthetic training domain. ColonCrafter achieves state-of-the-art zero-shot performance on the C3VD dataset, outperforming both general-purpose and endoscopyspecific approaches. Although full trajectory 3D reconstruction remains a challenge, we demonstrate clinically relevant applications of ColonCrafter, including 3D point cloud generation and surface coverage assessment. Our code will be made publicly available at https://github.com/rajpurkarlab/ColonCrafter.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"27-41"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seowon Chang, Anna Shcherbina, Tal Ashuach, Shahin Mohammadi, Stephanie See, Ninad Ranadive, Emily Fox, Navpreet Ranu
{"title":"PertSpectra: Interpretable Matrix Factorization for Predicting Functional Impact of Genetic Perturbation Experiments.","authors":"Seowon Chang, Anna Shcherbina, Tal Ashuach, Shahin Mohammadi, Stephanie See, Ninad Ranadive, Emily Fox, Navpreet Ranu","doi":"10.1142/9789819824755_0033","DOIUrl":"10.1142/9789819824755_0033","url":null,"abstract":"<p><p>In drug discovery, measuring the effects of genetic perturbations is a powerful tool for studying unknown disease mechanisms, but biological interpretation of these effects, especially with the advent of screens involving combinatorial perturbations, remains challenging. To address limitations in current methodology we introduce PertSpectra, a guided triple matrix factorization that incorporates perturbation information and regularizes the model using a known gene-gene interaction graph prior to generate sparse, biologically relevant latent factors that capture perturbational effects. We evaluate PertSpectra on three single-cell RNAseq datasets with both single and combinatorial genetic perturbations, measuring latent space interpretability, predictive ability on unseen combinations of observed perturbations, and stratification of functionally similar perturbations. We show that PertSpectra provides an integrated modeling approach to understanding combinatorial perturbation data in the context of drug discovery.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"465-479"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implicitly and Differentiably Representing Protein Surfaces and Interfaces.","authors":"Cory B Scott, Charlie Rothschild, Benjamin E Nye","doi":"10.1142/9789819824755_0030","DOIUrl":"10.1142/9789819824755_0030","url":null,"abstract":"<p><p>We introduce a pipeline for implicitly representing a protein, or protein complex, as the union of signed distance functions (SDFs) by representing each atom as a sphere with the appropriate van der Waals radius. While this idea has been used previously as a way to render images of proteins, it has not, to our knowledge, been widely adopted in a machine learning setting. Mirroring recent successful work applying SDFs to represent 3D geometry, we present a proof of concept that this representation of proteins could be useful in several biologically relevant applications. We also propose further experiments that are necessary to validate the proposed approach.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"425-437"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Chen, Sam Postelnik, Kameron Black, Yixing Jiang, Jonathan H Chen
{"title":"MedAgentBench v2: Improving Medical LLM Agent Design.","authors":"Eric Chen, Sam Postelnik, Kameron Black, Yixing Jiang, Jonathan H Chen","doi":"10.1142/9789819824755_0025","DOIUrl":"10.1142/9789819824755_0025","url":null,"abstract":"<p><p>MedAgentBench is the first benchmark for evaluating LLM agents on clinical tasks in a FHIR-compliant EHR. In this paper, we present significant prompt engineering and tool design improvements over the original agent implementation and introduce a memory component that enables the agent to learn from prior failures. We added new tools for the agent to properly format its output for tasks, interact with an EHR without constructing explicit HTTP requests, which were prone to syntax errors, and make math calculations. We also wrote a new system prompt that asked the agent to outline its plan before making any tool calls and think step by step using chain of thought reasoning, and provided few shot examples of good vs. bad outputs. Using GPT-4.1 as the base model, our agent achieved a success rate of 91.0% without memory and 98.0% with memory. A surprising consequence is that the agent performed better on a different task that had no associated memory entry, possibly demonstrating that LLMs can adapt to the style of tasks presented by users. To contribute to the benchmark and evaluate the generalization of our agent, we developed 300 new multi-step clinically-driven tasks in collaboration with a physician. Lastly, we show the current limitations of these benchmarks and highlight the necessary next steps and challenges for the responsible deployment of AI agents in real-world healthcare settings. We hope that this paper leads to further development of EHR agents and benchmarks.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"354-371"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liam G McCoy, David Wu, Sarita Khemani, Saloni Kumar Maharaj, Arth Pahwa, Leah Rosengaus, Lena Giang, Olivia Jee, Ethan Goh, Fateme Nateghi Haredasht, Kanav Chopra, David Jh Wu, Abass Conteh, Vishnu Ravi, Yingjie Weng, Kelvin Zhenghao Li, Daniel Shirvani, Jonathan H Chen
{"title":"Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates.","authors":"Liam G McCoy, David Wu, Sarita Khemani, Saloni Kumar Maharaj, Arth Pahwa, Leah Rosengaus, Lena Giang, Olivia Jee, Ethan Goh, Fateme Nateghi Haredasht, Kanav Chopra, David Jh Wu, Abass Conteh, Vishnu Ravi, Yingjie Weng, Kelvin Zhenghao Li, Daniel Shirvani, Jonathan H Chen","doi":"10.1142/9789819824755_0028","DOIUrl":"10.1142/9789819824755_0028","url":null,"abstract":"<p><p>This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models-including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro-for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician communication. Limitations include reliance on Stanford-specific templates and concordancebased grading, which may not capture all clinically reasonable outputs.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"400-416"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13138846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven E Brenner, Nilah M Ioannidis, Tayo Obafemi-Ajayi, Anne O'Donnell-Luria
{"title":"Session Introduction: Precision Medicine: Integrating Large-Scale Data and Intermediate Phenotypes for Understanding Health and Treating Disease.","authors":"Steven E Brenner, Nilah M Ioannidis, Tayo Obafemi-Ajayi, Anne O'Donnell-Luria","doi":"10.1142/9789819824755_0043","DOIUrl":"10.1142/9789819824755_0043","url":null,"abstract":"<p><p>The field of precision medicine has undergone rapid development over the past three decades, driven by advances in high-throughput molecular profiling, large-scale electronic health data, and computational modeling. The central objective is to refine disease risk prediction, diagnosis, and treatment strategies by incorporating genetic, molecular, environmental, and clinical information into individualized care. However, the effective integration of these heterogeneous data sources presents substantial analytical challenges. The 2026 Precision Medicine session of the Pacific Symposium on Biocomputing (PSB) highlights computational methods that bridge large-scale biological data and intermediate phenotypes, emphasizing approaches that advance mechanistic understanding, risk prediction, and clinical utility. The contributions span multi-modal risk modeling, biomarker discovery, and causal inference frameworks, demonstrating the breadth and depth of research in computational precision medicine.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"596-599"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis.","authors":"Zhizheng Wang, Yifan Yang, Qiao Jin, Zhiyong Lu","doi":"10.1142/9789819824755_0035","DOIUrl":"10.1142/9789819824755_0035","url":null,"abstract":"<p><p>The gene set analysis (GSA) is a foundational approach for uncovering the molecular functions associated with a group of genes. Recently, LLM-powered methods have emerged to annotate gene sets with biological functions together with coherent explanatory insights. However, existing studies primarily focus on proprietary models, which have been shown to outperform their open-source counterparts despite concerns over cost and data privacy. Furthermore, no research has investigated the application of advanced reasoning strategies to the GSA task. To address this gap, we introduce Gene-R1, a data-augmented learning framework that equips lightweight and open-source LLMs with step-by-step reasoning capabilities tailored to GSA. Experiments on 1,508 in-distribution gene sets demonstrate that Gene-R1 achieves substantial performance gains, matching commercial LLMs. On 106 out-of-distribution gene sets, Gene-R1 performs comparably to both commercial and large-scale LLMs, exhibiting robust generalizability across diverse gene sources.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"494-507"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AI.","authors":"Ke Chen, Haohan Wang","doi":"10.1142/9789819824755_0058","DOIUrl":"10.1142/9789819824755_0058","url":null,"abstract":"<p><p>Modern disease classification often overlooks molecular commonalities hidden beneath divergent clinical presentations. This study introduces a transcriptomics-driven framework for discovering disease relationships by analyzing over 1,300 disease-condition pairs using GenoMAS, a fully automated agentic AI system. Beyond identifying robust gene-level overlaps, we develop a novel pathway-based similarity framework that integrates multi-database enrichment analysis to quantify functional convergence across diseases. The resulting disease similarity network reveals both known comorbidities and previously undocumented crosscategory links. By examining shared biological pathways, we explore potential molecular mechanisms underlying these connections-offering functional hypotheses that go beyond symptom-based taxonomies. We further show how background conditions such as obesity and hypertension modulate transcriptomic similarity, and identify therapeutic repurposing opportunities for rare diseases like autism spectrum disorder based on their molecular proximity to better-characterized conditions. In addition, this work demonstrates how biologically grounded agentic AI can scale transcriptomic analysis while enabling mechanistic interpretation across complex disease landscapes. All results are publicly accessible at https://github.com/KeeeeChen/Pathway_Similarity_Network.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"31 ","pages":"799-814"},"PeriodicalIF":0.0,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}