Bioinformatics advancesPub Date : 2025-03-17eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf055
Jackie Rao, Paul D W Kirk
{"title":"VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data.","authors":"Jackie Rao, Paul D W Kirk","doi":"10.1093/bioadv/vbaf055","DOIUrl":"https://doi.org/10.1093/bioadv/vbaf055","url":null,"abstract":"<p><strong>Summary: </strong>Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratification of patients or samples. However, the growth in availability of high-dimensional categorical data, including 'omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in terms of computational time and scalability, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarization and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas, showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's potential utility in integrative cluster analysis with different 'omics datasets, enabling the discovery of novel disease subtypes.</p><p><strong>Availability and implementation: </strong>VICatMix is freely available as an R package via CRAN, incorporating C++ for faster computation, at https://CRAN.R-project.org/package=VICatMix.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf055"},"PeriodicalIF":2.4,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11981716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144036621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf057
Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert
{"title":"RNApysoforms: fast rendering interactive visualization of RNA isoform structure and expression in Python.","authors":"Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert","doi":"10.1093/bioadv/vbaf057","DOIUrl":"10.1093/bioadv/vbaf057","url":null,"abstract":"<p><strong>Summary: </strong>Alternative splicing generates multiple RNA isoforms from a single gene, enriching genetic diversity and impacting gene function. Effective visualization of these isoforms and their expression patterns is crucial but challenging due to limitations in existing tools. Traditional genome browsers lack programmability, while other tools offer limited customization, produce static plots, or cannot simultaneously display structures and expression levels. RNApysoforms was developed to address these gaps by providing a Python-based package that enables concurrent visualization of RNA isoform structures and expression data. Leveraging plotly and polars libraries, it offers an interactive, customizable, and faster-rendering framework suitable for web applications, enhancing the analysis and dissemination of RNA isoform research.</p><p><strong>Availability and implementation: </strong>RNApysoforms is a Python package available at (https://github.com/UK-SBCoA-EbbertLab/RNApysoforms) and (https://zenodo.org/records/14941190) via an open-source MIT license. It can be easily installed using the pip package installer for Python. Thorough documentation and usage vignettes are available at: https://rna-pysoforms.readthedocs.io/en/latest/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf057"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964586/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf056
Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin
{"title":"Challenges in predicting PROTAC-mediated protein-protein interfaces with AlphaFold reveal a general limitation on small interfaces.","authors":"Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin","doi":"10.1093/bioadv/vbaf056","DOIUrl":"10.1093/bioadv/vbaf056","url":null,"abstract":"<p><strong>Motivation: </strong>Proteolysis Targeting Chimeras (PROTACs) are heterobifunctional molecules composed by ligands binding to a target protein and a E3-ligase complex, connected by a linker, that induce proximity-based target protein degradation. PROTACs are promising alternatives to conventional drugs against cancer. Predicting PROTAC-mediated complexes is often the first step for <i>in silico</i> PROTAC design pipelines. We previously noted that AlphaFold2 (AF2) fails to predict PROTAC-mediated complexes.</p><p><strong>Results: </strong>Here, we investigate the potential causes of this limitation. We consider a set of 326 protein heterodimers orthogonal to the AF2 training set, and evaluate AF2 models focusing on the interface size and presence of interface ligand. Our results show that AF2-multimer predictions are sensitive to the size of the interface to predict even in the absence of ligands, with the majority of models being incorrect for the smallest interfaces. We also benchmark both AF2 and AF3 on a set of 28 PROTAC-mediated dimers and show that AF3 does not significantly improve upon the accuracy of AF2. The low accuracy of AF2 on complexes with small interfaces has strong implications for computational pipelines for PROTAC design, as these stabilize typically small interfaces, and more generally on any prediction task that involves small interfaces.</p><p><strong>Availability and implementation: </strong>All the models analyzed in this article are available in the Zenodo archive https://zenodo.org/records/14810843.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf056"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143722845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf052
Myriam Brossard, Delnaz Roshandel, Kexin Luo, Fatemeh Yavartanoo, Andrew D Paterson, Yun J Yoo, Shelley B Bull
{"title":"RegionScan: a comprehensive R package for region-level genome-wide association testing with integration and visualization of multiple-variant and single-variant hypothesis testing.","authors":"Myriam Brossard, Delnaz Roshandel, Kexin Luo, Fatemeh Yavartanoo, Andrew D Paterson, Yun J Yoo, Shelley B Bull","doi":"10.1093/bioadv/vbaf052","DOIUrl":"10.1093/bioadv/vbaf052","url":null,"abstract":"<p><strong>Summary: </strong>RegionScan is designed for scalable genome-wide association testing of both multiple-variant and single-variant region-level statistics, with visualization of the results. For detection of association under various regional architectures, it implements three classes of state-of-the-art region-level tests, including multiple-variant linear/logistic regression (with and without dimension reduction), a variance-component score test, and region-level min<i>P</i> tests. RegionScan also supports the analysis of multi-allelic variants and unbalanced binary phenotypes and is compatible with widely used variant call format (VCF) files for both genotyped and imputed variants. Association testing leverages linkage disequilibrium (LD) structure in pre-defined regions, for example, LD-adaptive regions obtained by genomic partitioning, and accommodates parallel processing to improve computational and memory efficiency. Detailed outputs (with allele frequencies, variant-LD bin assignment, single/joint variant effect estimates and region-level results) and utility functions are provided to assist comparison, visualization, and interpretation of results. Thus, RegionScan analysis offers valuable insights into region-level genetic architecture, which supports a wide range of potential applications.</p><p><strong>Availability and implementation: </strong>RegionScan is freely available for download on GitHub (https://github.com/brossardMyriam/RegionScan).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf052"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143756193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf054
Jiqing Zhu, Rebecca Y Wang, Xiaoting Wang, Ricardo Azevedo, Alexander Moreno, Julia A Kuhn, Zia Khan
{"title":"Enhancing gene set overrepresentation analysis with large language models.","authors":"Jiqing Zhu, Rebecca Y Wang, Xiaoting Wang, Ricardo Azevedo, Alexander Moreno, Julia A Kuhn, Zia Khan","doi":"10.1093/bioadv/vbaf054","DOIUrl":"10.1093/bioadv/vbaf054","url":null,"abstract":"<p><strong>Motivation: </strong>Gene set overrepresentation analysis (ORA) is widely used to interpret high-throughput transcriptomics and proteomics data, but traditional methods rely on human-curated gene set databases that lack flexibility.</p><p><strong>Results: </strong>We introduce <i>llm2geneset</i>, a framework that leverages large language models (LLMs) to dynamically generate gene set databases tailored to input query genes, such as differentially expressed genes and a biological context specified in natural language. These databases integrate with methods, such as ORA, to assign biological functions to input genes. Benchmarking against human-curated gene sets demonstrates that LLMs generate gene sets comparable in quality to those curated by humans. <i>llm2geneset</i> also identifies biological processes represented by input gene sets, outperforming traditional ORA and direct LLM prompting. Applying the framework to RNA-seq data from iPSC-derived microglia treated with a <i>TREM2</i> agonist highlights its potential for flexible, context-aware gene set generation and improved interpretation of high-throughput biological data.</p><p><strong>Availability and implementation: </strong><i>llm2geneset</i> is available as open source at https://github.com/Alector-BIO/llm2geneset and via a web interface at https://llm2geneset.streamlit.app.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf054"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12093311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144121572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf053
Abdur Rafi, Ahmed Mahir Sultan Rumi, Sheikh Azizul Hakim, Sohaib, Md Toki Tahmid, Rabib Jahin Ibn Momin, Tanjeem Azwad Zaman, Rezwana Reaz, Md Shamsuzzoha Bayzid
{"title":"wQFM-TREE: highly accurate and scalable quartet-based species tree inference from gene trees.","authors":"Abdur Rafi, Ahmed Mahir Sultan Rumi, Sheikh Azizul Hakim, Sohaib, Md Toki Tahmid, Rabib Jahin Ibn Momin, Tanjeem Azwad Zaman, Rezwana Reaz, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbaf053","DOIUrl":"10.1093/bioadv/vbaf053","url":null,"abstract":"<p><strong>Motivation: </strong>methods are becoming increasingly popular for species tree estimation from multi-locus data in the presence of gene tree discordance. Accurate Species TRee Algorithm (ASTRAL), a leading method in this class, solves the Maximum Quartet Support Species Tree problem within a constrained solution space, while heuristics like Weighted Quartet Fiduccia-Mattheyses (wQFM) and Weighted Quartet MaxCut (wQMC) use weighted quartets and a divide-and-conquer strategy. Recent studies showed wQFM to be more accurate than ASTRAL and wQMC, though its scalability is hindered by the computational demands of explicitly generating and weighting <math><mrow><mi>Θ</mi> <mo>(</mo> <mrow> <msup><mrow><mi>n</mi></mrow> <mn>4</mn></msup> </mrow> <mo>)</mo></mrow> </math> quartets. Here, we introduce wQFM-TREE, a novel summary method that enhances wQFM by avoiding explicit quartet generation and weighting, enabling its application to large datasets.</p><p><strong>Results: </strong>Extensive simulations under diverse and challenging model conditions, with hundreds or thousands of taxa and genes, consistently demonstrate that wQFM-TREE matches or improves upon the accuracy of ASTRAL. It outperformed ASTRAL in 25 of 27 model conditions (statistically significant in 20) involving 200-1000 taxa. Moreover, applying wQFM-TREE to re-analyze the green plant dataset from the One Thousand Plant Transcriptomes Initiative produced a tree highly congruent with established evolutionary relationships of plants. wQFM-TREE's remarkable accuracy and scalability make it a strong competitor to leading methods. Its algorithmic and combinatorial innovations also enhance quartet-based computations, advancing phylogenetic estimation.</p><p><strong>Availability and implementation: </strong>wQFM-TREE is freely available in open source form at https://github.com/abdur-rafi/wQFM-TREE.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf053"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11932941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143712300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf050
Ankita Pal, Debasisa Mohanty
{"title":"Machine learning-based approach for identification of new resistance associated mutations from whole genome sequences of <i>Mycobacterium tuberculosis</i>.","authors":"Ankita Pal, Debasisa Mohanty","doi":"10.1093/bioadv/vbaf050","DOIUrl":"10.1093/bioadv/vbaf050","url":null,"abstract":"<p><strong>Motivation: </strong>Currently available methods for the prediction of genotypic drug resistance in <i>Mycobacterium tuberculosis</i> utilize information on known markers of drug resistance. Hence, machine learning approaches are needed that can discover new resistance markers.</p><p><strong>Results: </strong>Whole genome sequences with known phenotypic drug resistance profiles have been utilized to train XGBoost and ANN classifiers for 5 first-line and 8 second-line tuberculosis drugs. Benchmarking on a completely independent dataset from CRyPTIC database revealed that our method has high sensitivity (90%-95%) and specificity (94%-99%) for five first-line drugs and robust performance for six second-line drugs with a sensitivity of 77%-89% at over 95% specificity. An explainable AI method, SHapley Additive exPlanations, has successfully identified resistance mutations for each drug in a completely automated way. This approach could not only identify known resistance associated mutations in agreement with the WHO mutation catalogue, but also predicted >100 other potential resistance associated mutations for 13 antibiotics in new genes outside the known resistance loci. Identification of new resistance markers opens up the opportunity for the discovery of novel mechanisms of drug resistance.</p><p><strong>Availability and implementation: </strong>Our prediction method has been implemented as TB-AMRpred webserver and command line tool, available freely at http://www.nii.ac.in/TB-AMRpred.html and https://github.com/Ankitapal1995/TB-AMRprd.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf050"},"PeriodicalIF":2.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11930343/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143694157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf045
Asahi Adachi, Fan Zhang, Shigehiko Kanaya, Naoaki Ono
{"title":"Quantifying uncertainty in microbiome-based prediction using Gaussian processes with microbial community dissimilarities.","authors":"Asahi Adachi, Fan Zhang, Shigehiko Kanaya, Naoaki Ono","doi":"10.1093/bioadv/vbaf045","DOIUrl":"10.1093/bioadv/vbaf045","url":null,"abstract":"<p><strong>Summary: </strong>The human microbiome is closely associated with the health and disease of the human host. Machine learning models have recently utilized the human microbiome to predict health conditions and disease status. Quantifying predictive uncertainty is essential for the reliable application of these microbiome-based prediction models in clinical settings. However, uncertainty quantification in such prediction models remains unexplored. In this study, we have developed a probabilistic prediction model using a Gaussian process (GP) with a kernel function that incorporates microbial community dissimilarities. We evaluated the performance of probabilistic prediction across three regression tasks: chronological age, body mass index, and disease severity, using publicly available human gut microbiome datasets. The results demonstrated that our model outperformed existing methods in terms of probabilistic prediction accuracy. Furthermore, we found that the confidence levels closely matched the empirical coverage and that data points predicted with lower uncertainty corresponded to lower prediction errors. These findings suggest that GP regression models incorporating community dissimilarities effectively capture the characteristics of phylogenetic, high-dimensional, and sparse microbial abundance data. Our study provides a more reliable framework for microbiome-based prediction, potentially advancing the application of microbiome data in health monitoring and disease diagnosis in clinical settings.</p><p><strong>Availability and implementation: </strong>The code is available at https://github.com/asahiadachi/gp4microbiome.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf045"},"PeriodicalIF":2.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919817/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf049
Francesco Costa, Rob Barringer, Ioannis Riziotis, Antonina Andreeva, Alex Bateman
{"title":"Isopeptor: a tool for detecting intramolecular isopeptide bonds in protein structures.","authors":"Francesco Costa, Rob Barringer, Ioannis Riziotis, Antonina Andreeva, Alex Bateman","doi":"10.1093/bioadv/vbaf049","DOIUrl":"10.1093/bioadv/vbaf049","url":null,"abstract":"<p><strong>Motivation: </strong>Intramolecular isopeptide bonds contribute to the structural stability of proteins, and have primarily been identified in domains of bacterial fibrillar adhesins and pili. At present, there is no systematic method available to detect them in newly determined molecular structures. This can result in mis-annotations and incorrect modeling.</p><p><strong>Results: </strong>Here, we present Isopeptor, a computational tool designed to predict the presence of intramolecular isopeptide bonds in experimentally determined structures. Isopeptor utilizes structure-guided template matching via the Jess software, combined with a logistic regression classifier that incorporates root mean square deviation and relative solvent accessible area as key features. The tool demonstrates a precision of 1.0 and a recall of 0.947 when tested on a Protein Data Bank subset of domains known to contain intramolecular isopeptide bonds that have been deposited with incorrectly modeled geometries.</p><p><strong>Availability and implementation: </strong>Isopeptor's Python-based implementation supports integration into bioinformatics workflows and can be accessed via the command line, through a Python API or via a Google Colaboratory implementation (https://colab.research.google.com/github/FranceCosta/Isopeptor_development/blob/main/notebooks/Isopeptide_finder.ipynb). Source code is hosted on GitHub (https://github.com/FranceCosta/isopeptor) and can be installed via the Python package installation manager PIP.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf049"},"PeriodicalIF":2.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-10eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf046
Demetris Avraam, Rebecca C Wilson, Noemi Aguirre Chan, Soumya Banerjee, Tom R P Bishop, Olly Butters, Tim Cadman, Luise Cederkvist, Liesbeth Duijts, Xavier Escribà Montagut, Hugh Garner, Gonçalo Gonçalves, Juan R González, Sido Haakma, Mette Hartlev, Jan Hasenauer, Manuel Huth, Eleanor Hyde, Vincent W V Jaddoe, Yannick Marcon, Michaela Th Mayrhofer, Fruzsina Molnar-Gabor, Andrei Scott Morgan, Madeleine Murtagh, Marc Nestor, Anne-Marie Nybo Andersen, Simon Parker, Angela Pinot de Moira, Florian Schwarz, Katrine Strandberg-Larsen, Morris A Swertz, Marieke Welten, Stuart Wheater, Paul Burton
{"title":"DataSHIELD: mitigating disclosure risk in a multi-site federated analysis platform.","authors":"Demetris Avraam, Rebecca C Wilson, Noemi Aguirre Chan, Soumya Banerjee, Tom R P Bishop, Olly Butters, Tim Cadman, Luise Cederkvist, Liesbeth Duijts, Xavier Escribà Montagut, Hugh Garner, Gonçalo Gonçalves, Juan R González, Sido Haakma, Mette Hartlev, Jan Hasenauer, Manuel Huth, Eleanor Hyde, Vincent W V Jaddoe, Yannick Marcon, Michaela Th Mayrhofer, Fruzsina Molnar-Gabor, Andrei Scott Morgan, Madeleine Murtagh, Marc Nestor, Anne-Marie Nybo Andersen, Simon Parker, Angela Pinot de Moira, Florian Schwarz, Katrine Strandberg-Larsen, Morris A Swertz, Marieke Welten, Stuart Wheater, Paul Burton","doi":"10.1093/bioadv/vbaf046","DOIUrl":"10.1093/bioadv/vbaf046","url":null,"abstract":"<p><strong>Motivation: </strong>The validity of epidemiologic findings can be increased using triangulation, i.e. comparison of findings across contexts, and by having sufficiently large amounts of relevant data to analyse. However, access to data is often constrained by practical considerations and by ethico-legal and data governance restrictions. Gaining access to such data can be time-consuming due to the governance requirements associated with data access requests to institutions in different jurisdictions.</p><p><strong>Results: </strong>DataSHIELD is a software solution that enables remote analysis without the need for data transfer (federated analysis). DataSHIELD is a scientifically mature, open-source data access and analysis platform aligned with the 'Five Safes' framework, the international framework governing safe research access to data. It allows real-time analysis while mitigating disclosure risk through an active multi-layer system of disclosure-preventing mechanisms. This combination of real-time remote statistical analysis, disclosure prevention mechanisms, and federation capabilities makes DataSHIELD a solution for addressing many of the technical and regulatory challenges in performing the large-scale statistical analysis of health and biomedical data. This paper describes the key components that comprise the disclosure protection system of DataSHIELD. These broadly fall into three classes: (i) system protection elements, (ii) analysis protection elements, and (iii) governance protection elements.</p><p><strong>Availability and implementation: </strong>Information about the DataSHIELD software is available in https://datashield.org/ and https://github.com/datashield.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf046"},"PeriodicalIF":2.4,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11968321/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}