Biodata MiningPub Date : 2025-08-05DOI: 10.1186/s13040-025-00466-5
Zhiping Paul Wang, Xi Li, Nicholas Matsumoto, Mythreye Venkatesan, Jui-Hsuan Chang, Jay Moran, Hyunjun Choi, Binglan Li, Yufei Meng, Miguel E Hernandez, Jason H Moore
{"title":"Drug repurposing for Alzheimer's disease using a graph-of-thoughts based large language model to infer drug-disease relationships in a comprehensive knowledge graph.","authors":"Zhiping Paul Wang, Xi Li, Nicholas Matsumoto, Mythreye Venkatesan, Jui-Hsuan Chang, Jay Moran, Hyunjun Choi, Binglan Li, Yufei Meng, Miguel E Hernandez, Jason H Moore","doi":"10.1186/s13040-025-00466-5","DOIUrl":"10.1186/s13040-025-00466-5","url":null,"abstract":"<p><p>Drug repurposing (DR) offers a promising alternative to the high cost and low success rate of traditional drug development, especially for complex diseases like Alzheimer's disease (AD). This study addressed DR for AD from three key angles: (1) demonstrating how disease-specific knowledge graphs can improve DR performance, (2) evaluating the role of large language models (LLMs) in enhancing the usability and efficiency of these graphs, and (3) assessing whether Graph-of-Thoughts (GoT)-enhanced LLMs, when integrated with AD knowledge graphs, can outperform traditional machine learning and LLM-based approaches. We tested five distinct DR strategies (DR1-DR5) for AD: DR1, a machine learning method using TxGNN; DR2, a machine learning model leveraging the Alzheimer's KnowledgeBase (AlzKB); DR3, an LLM-based chatbot built on AlzKB; DR4, our ESCARGOT framework combining GoT-enhanced LLMs with AlzKB; and DR5, a general reasoning-driven LLM approach. Results showed that AlzKB significantly improved DR outcomes. ESCARGOT further enhanced performance while reducing the need for coding or advanced expertise in knowledge graph analysis. Because the architecture of AlzKB is easily adaptable to other diseases and ESCARGOT can integrate with various knowledge graph platforms, this framework offers a broadly applicable, innovative tool for accelerating drug discovery through repurposing.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"51"},"PeriodicalIF":6.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12326721/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144790506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-08-01DOI: 10.1186/s13040-025-00468-3
Petr Ryšavý, Alikhan Anuarbekov, Michaela Dostálová Merkerová, Jiří Kléma
{"title":"circGPAcorr: an integrative tool for functional annotation of circular RNAs using expression data.","authors":"Petr Ryšavý, Alikhan Anuarbekov, Michaela Dostálová Merkerová, Jiří Kléma","doi":"10.1186/s13040-025-00468-3","DOIUrl":"10.1186/s13040-025-00468-3","url":null,"abstract":"<p><p>Circular RNAs play a crucial role in cell development and serve as biomarkers in many diseases. Nevertheless, the function of many circular RNAs remains unknown. This function can be inferred from sponging and silencing interactions with micro RNAs and messenger RNAs. We recently proposed a network-based circRNA functional annotation tool, circGPA. However, validation data for RNA interactions are often sparse and predicted interactions contain many false positives. To address this issue, we propose an extended algorithm named circGPAcorr, which uses expression data to weight the interactions, resulting in more precise functional annotation. To assess the significance of the results, the p-value is calculated using reduction to circGPA, a generating-polynomial-based method. We show that the problem is #P-hard, and thus computationally difficult. The circGPAcorr algorithm is tested on publicly available myelodysplastic syndromes expression data, providing gene ontology annotations that align with the literature on myelodysplastic syndromes. At the same time, we demonstrate its performance in the circRNA-disease annotation task.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"50"},"PeriodicalIF":6.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12317645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144765669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-07-29DOI: 10.1186/s13040-025-00467-4
Muhammad Arshad, Chengliang Wang, Muhammad Wajeeh Us Sima, Jamshed Ali Shaikh, Hanen Karamti, Raed Alharthi, Julius Selecky
{"title":"BioAug-Net: a bioimage sensor-driven attention-augmented segmentation framework with physiological coupling for early prostate cancer detection in T2-weighted MRI.","authors":"Muhammad Arshad, Chengliang Wang, Muhammad Wajeeh Us Sima, Jamshed Ali Shaikh, Hanen Karamti, Raed Alharthi, Julius Selecky","doi":"10.1186/s13040-025-00467-4","DOIUrl":"10.1186/s13040-025-00467-4","url":null,"abstract":"<p><p>Accurate segmentation of the prostate peripheral zone (PZ) in T2-weighted MRI is critical for the early detection of prostate cancer. Existing segmentation methods are hindered by significant inter-observer variability (37.4 ± 5.6%), poor boundary localization, and the presence of motion artifacts, along with challenges in clinical integration. In this study, we propose BioAug-Net, a novel framework that integrates real-time physiological signal feedback with MRI data, leveraging transformer-based attention mechanisms and a probabilistic clinical decision support system (PCDSS). BioAug-Net features a dual-branch asymmetric attention mechanism: one branch processes spatial MRI features, while the other incorporates temporal sensor signals through a BiGRU-driven adaptive masking module. Additionally, a Markov Decision Process-based PCDSS maps segmentation outputs to clinical PI-RADS scores, with uncertainty quantification. We validated BioAug-Net on a multi-institutional dataset (n=1,542) and demonstrated state-of-the-art performance, achieving a Dice Similarity Coefficient of 89.7% (p < 0.001), sensitivity of 91.2% (p < 0.001), specificity of 88.4% (p < 0.001), and HD95 of 2.14 mm (p < 0.001), outperforming U-Net, Attention U-Net, and TransUNet. Sensor integration improved segmentation accuracy by 12.6% (p < 0.001) and reduced inter-observer variation by 48.3% (p < 0.001). Radiologist evaluations (n=3) confirmed a 15.0% reduction in diagnosis time (p = 0.003) and an increase in inter-reader agreement from K = 0.68 to K = 0.82 (p = 0.001). Our results show that BioAug-Net offers a clinically viable solution for early prostate cancer detection through enhanced physiological coupling and explainable AI diagnostics.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"49"},"PeriodicalIF":6.1,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144745615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-07-24DOI: 10.1186/s13040-025-00463-8
Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer
{"title":"Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.","authors":"Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer","doi":"10.1186/s13040-025-00463-8","DOIUrl":"10.1186/s13040-025-00463-8","url":null,"abstract":"<p><strong>Background: </strong>Tumor documentation in Germany is currently a largely manual process. It involves reading the textual patient documentation and filling in forms in dedicated databases to obtain structured data. Advances in information extraction techniques that build on large language models (LLMs) could have the potential for enhancing the efficiency and reliability of this process. Evaluating LLMs in the German medical domain, especially their ability to interpret specialized language, is essential to determine their suitability for the use in clinical documentation. Due to data protection regulations, only locally deployed open source LLMs are generally suitable for this application.</p><p><strong>Methods: </strong>The evaluation employs eleven different open source LLMs with sizes ranging from 1 to 70 billion model parameters. Three basic tasks were selected as representative examples for the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general.</p><p><strong>Results: </strong>The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation.</p><p><strong>Conclusions: </strong>Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval . We also release the data set under https://huggingface.co/datasets/stefan-m-lenz/UroLlmEvalSet providing a valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"48"},"PeriodicalIF":6.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12291363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144709599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-07-11DOI: 10.1186/s13040-025-00461-w
Mohamed Mustaf Ahmed, Olalekan John Okesanya, Majd Oweidat, Zhinya Kawa Othman, Shuaibu Saidu Musa, Don Eliseo Lucero-Prisno Iii
{"title":"The ethics of data mining in healthcare: challenges, frameworks, and future directions.","authors":"Mohamed Mustaf Ahmed, Olalekan John Okesanya, Majd Oweidat, Zhinya Kawa Othman, Shuaibu Saidu Musa, Don Eliseo Lucero-Prisno Iii","doi":"10.1186/s13040-025-00461-w","DOIUrl":"10.1186/s13040-025-00461-w","url":null,"abstract":"<p><p>Data mining in healthcare offers transformative insights yet surfaces multilayered ethical and governance challenges that extend beyond privacy alone. Privacy and consent concerns remain paramount when handling sensitive medical data, particularly as healthcare organizations increasingly share patient information with large digital platforms. The risks of data breaches and unauthorized access are stark: 725 reportable incidents in 2023 alone exposed more than 133 million patient records, and hacking-related breaches surged by 239% since 2018. Algorithmic bias further threatens equity; models trained on historically prejudiced data can reinforce health disparities across protected groups. Therefore, transparency must span three levels-dataset documentation, model interpretability, and post-deployment audit logging-to make algorithmic reasoning and failures traceable. Security vulnerabilities in the Internet of Medical Things (IoMT) and cloud-based health platforms amplify these risks, while corporate data-sharing deals complicate questions of data ownership and patient autonomy. A comprehensive response requires (i) dataset-level artifacts such as \"datasheets,\" (ii) model-cards that disclose fairness metrics, and (iii) continuous logging of predictions and LIME/SHAP explanations for independent audits. Technical safeguards must blend differential privacy (with empirically validated noise budgets), homomorphic encryption for high-value queries, and federated learning to maintain the locality of raw data. Governance frameworks must also mandate routine bias and robust audits and harmonized penalties for non-compliance. Regular reassessments, thorough documentation, and active engagement with clinicians, patients, and regulators are critical to accountability. This paper synthesizes current evidence, from a 2019 European re-identification study demonstrating 99.98% uniqueness with 15 quasi-identifiers to recent clinical audits that trimmed false-negative rates via threshold recalibration, and proposes an integrated set of fairness, privacy, and security controls aligned with SPIRIT-AI, CONSORT-AI, and emerging PROBAST-AI guidelines. Implementing these solutions will help healthcare systems harness the benefits of data mining while safeguarding patient rights and sustaining public trust.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"47"},"PeriodicalIF":4.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144620971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-07-01DOI: 10.1186/s13040-025-00462-9
Jason H Moore, Nicholas Tatonetti
{"title":"Vibe coding: a new paradigm for biomedical software development.","authors":"Jason H Moore, Nicholas Tatonetti","doi":"10.1186/s13040-025-00462-9","DOIUrl":"10.1186/s13040-025-00462-9","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"46"},"PeriodicalIF":4.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12217882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144545739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-20DOI: 10.1186/s13040-025-00460-x
Namareq Widatalla, Sona Al Younis, Ahsan Khandoker
{"title":"Heart rate transition patterns reveal autonomic dysfunction in heart failure with renal function decline: a symbolic and Markov model approach.","authors":"Namareq Widatalla, Sona Al Younis, Ahsan Khandoker","doi":"10.1186/s13040-025-00460-x","DOIUrl":"10.1186/s13040-025-00460-x","url":null,"abstract":"<p><p>Around half of heart failure (HF) patients develop chronic kidney disease (CKD) and early detection of renal impairment in HF remains a clinical challenge. Both HF and CKD are characterized by autonomic dysfunction, suggesting that early identification of autonomic dysregulation may assist in early diagnosis and intervention. Conventional heart rate variability (HRV) metrics serve as non-invasive markers of autonomic nervous system (ANS) function; however, they are limited in their ability to capture directional and nonlinear dynamics associated with autonomic impairment during renal function decline. In this study, we digitized heart rate (HR) changes from 5-minute electrocardiogram (ECG) recordings in 358 patients with chronic HF (CHF). We applied a first-order Markov model and motif pattern analyses to compare HR transition dynamics between patients with normal and reduced estimated glomerular filtration rate (eGFR). The results revealed decreased monotonic HR transitions and increased tonic fluctuations in patients with reduced eGFR. Building on these findings, we introduced a transition stability index (TSI), which was significantly lower in patients with reduced eGFR compared to those with normal eGFR (p < 0.05). These results suggest that TSI may serve as a novel indicator of autonomic dysfunction associated with renal decline. Motif analysis further supported these findings by identifying distinctive HR transition patterns in patients with low eGFR.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"45"},"PeriodicalIF":4.0,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180264/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144337166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-19DOI: 10.1186/s13040-025-00459-4
Yasaman Fatapour, James P Brody
{"title":"A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores.","authors":"Yasaman Fatapour, James P Brody","doi":"10.1186/s13040-025-00459-4","DOIUrl":"10.1186/s13040-025-00459-4","url":null,"abstract":"<p><p>Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person's genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person's phenotype through the application of machine learning. We characterized a person's genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us program. We used the AutoML function (h2o.ai) in binary classification mode to identify the best models to differentiate between male/female, Black/white, white/Asian, and Black/Asian. We also used the AutoML function in regression mode to predict the height of people based on their age and genotype. Our results showed that we could effectively classify a person, using only information from chromosomes 1-22, as Male/Female (AUC = 0.9988 ± 0.0001), White/Black (AUC = 0.970 ± 0.002), Asian/White (AUC = 0.877 ± 0.002), and Black/Asian (AUC = 0.966 ± 0.002). This approach also effectively predicted height. In conclusion, we have shown that this compact representation of a person's genotype, along with machine learning, can effectively predict a person's phenotype.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"44"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144334213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent advances in deep learning for protein-protein interaction: a review.","authors":"Jiafu Cui, Siqi Yang, Litai Yi, Qilemuge Xi, Dezhi Yang, Yongchun Zuo","doi":"10.1186/s13040-025-00457-6","DOIUrl":"10.1186/s13040-025-00457-6","url":null,"abstract":"<p><p>Deep learning, a cornerstone of artificial intelligence, is driving rapid advancements in computational biology. Protein-protein interactions (PPIs) are fundamental regulators of biological functions. With the inclusion of deep learning in PPI research, the field is undergoing transformative changes. Therefore, there is an urgent need for a comprehensive review and assessment of recent developments to improve analytical methods and open up a wider range of biomedical applications. This review meticulously assesses deep learning progress in PPI prediction from 2021 to 2025. We evaluate core architectures (GNNs, CNNs, RNNs) and pioneering approaches-attention-driven Transformers, multi-task frameworks, multimodal integration of sequence and structural data, transfer learning via BERT and ESM, and autoencoders for interaction characterization. Moreover, we examined enhanced algorithms for dealing with data imbalances, variations, and high-dimensional feature sparsity, as well as industry challenges (including shifting protein interactions, interactions with non-model organisms, and rare or unannotated protein interactions), and offered perspectives on the future of the field. In summary, this review systematically summarizes the latest advances and existing challenges in deep learning in the field of protein interaction analysis, providing a valuable reference for researchers in the fields of computational biology and deep learning.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"43"},"PeriodicalIF":4.0,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12168265/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144310649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2025-06-14DOI: 10.1186/s13040-025-00458-5
Xiongbin Gui, Hanlin Lv, Xiao Wang, Longting Lv, Yi Xiao, Lei Wang
{"title":"Enhancing hepatopathy clinical trial efficiency: a secure, large language model-powered pre-screening pipeline.","authors":"Xiongbin Gui, Hanlin Lv, Xiao Wang, Longting Lv, Yi Xiao, Lei Wang","doi":"10.1186/s13040-025-00458-5","DOIUrl":"10.1186/s13040-025-00458-5","url":null,"abstract":"<p><strong>Background: </strong>Recruitment for cohorts involving complex liver diseases, such as hepatocellular carcinoma and liver cirrhosis, often requires interpreting semantically complex criteria. Traditional manual screening methods are time-consuming and prone to errors. While AI-powered pre-screening offers potential solutions, challenges remain regarding accuracy, efficiency, and data privacy.</p><p><strong>Methods: </strong>We developed a novel patient pre-screening pipeline that leverages clinical expertise to guide the precise, safe, and efficient application of large language models. The pipeline breaks down complex criteria into a series of composite questions and then employs two strategies to perform semantic question-answering through electronic health records: (1) Pathway A, Anthropomorphized Experts' Chain of Thought strategy; and (2) Pathway B, Preset Stances within an Agent Collaboration strategy, particularly in managing complex clinical reasoning scenarios. The pipeline is evaluated on key metrics including precision, recall, time consumption, and counterfactual inference-at both the question and criterion levels.</p><p><strong>Results: </strong>Our pipeline achieved a notable balance of high precision (e.g., 0.921, criteria level) and good overall recall (e.g., ~ 0.82, criteria level), alongside high efficiency (0.44s per task). Pathway B excelled in high-precision complex reasoning (while exhibiting a specific recall profile conducive to accuracy), whereas Pathway A was particularly effective for tasks requiring both robust precision and recall (e.g., direct data extraction), often with faster processing times. Both pathways achieved comparable overall precision while offering different strengths in the precision-recall trade-off. The pipeline showed promising precision-focused results in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843).</p><p><strong>Conclusions: </strong>This data-secure and time-efficient pipeline shows high precision and achieves good recall in hepatopathy trials, providing promising solutions for streamlining clinical trial workflows. Its efficiency, adaptability, and balanced performance profile make it suitable for improving patient recruitment. And its capability to function in resource-constrained environments further enhances its utility in clinical settings.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"42"},"PeriodicalIF":4.0,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12167571/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144295174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}