Journal of Big DataPub Date : 2026-01-01Epub Date: 2026-02-24DOI: 10.1186/s40537-026-01365-0
Peter-John Mäntylä Noble, Sean Oliver Farrell, Noura Al-Moubayed, Alan David Radford
{"title":"Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records.","authors":"Peter-John Mäntylä Noble, Sean Oliver Farrell, Noura Al-Moubayed, Alan David Radford","doi":"10.1186/s40537-026-01365-0","DOIUrl":"https://doi.org/10.1186/s40537-026-01365-0","url":null,"abstract":"<p><p>Historically, veterinary studies screening for breed, age and sex predisposition to disease have relied on collating small-scale studies of clinical datasets. The availability of larger datasets through groups such as the Small Animal Veterinary Surveillance Network (SAVSNET) promise access to information regarding a wide range of clinical presentations at scale, however, methodological limitations surrounding the extraction of specific disease information or screening for disease predispositions result in a substantial reduction in the number of animals studied. These studies often address very focused hypotheses - only leveraging a small fraction of the intrinsic value of the data at any one time. Here, we implemented an unsupervised machine learning methodology, creating a representation of a large volume of clinical notes collected by SAVSNET from veterinary practices across the UK. We utilise BERTopic, a topic-modelling tool based on Bidirectional Encoder Representations using Transformers (BERT) architecture, and show it is able to surface known phenotypes, such as breed predispositions to hypoadrenocorticism, diabetes mellitus and mitral valve disease, as well as potential novel patterns of disease phenotypes. This scalable and granular modelling technique facilitates the rapid interrogation of large clinical datasets, enabling the identification of a broad range of phenotypes within the population and the early detection of temporal changes indicative of emerging infectious or environmental diseases.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1186/s40537-026-01365-0.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"13 1","pages":"50"},"PeriodicalIF":6.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13035608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147592339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Journal of Big DataPub Date : 2026-01-01Epub Date: 2026-03-04DOI: 10.1186/s40537-026-01395-8
Partho Ghose, Hasan Jamil
{"title":"CardiaTics: An explainable AI integrated heart disease diagnosis model with feature engineering and stacked ensemble approach.","authors":"Partho Ghose, Hasan Jamil","doi":"10.1186/s40537-026-01395-8","DOIUrl":"https://doi.org/10.1186/s40537-026-01395-8","url":null,"abstract":"<p><p>Heart disease is a leading global cause of morbidity and mortality. Accurate and prompt diagnoses are crucial for its effective prevention and management. Integrating multiple machine learning algorithms, this research introduces a stacked ensemble machine learning model, called CardiaTics (stands for Cardiac DiagnosTics), toward improving heart disease detection. We detect outliers and remove them as a first-step to ensure data quality and maintain integrity. Ten distinct machine learning algorithms are then individually applied, culminating in the creation of a stacked ensemble model. We use feature engineering to refine the model further applying three well-known techniques -Pearson correlation, Chi-Square Test (Chi-2), and Recursive Feature Elimination. The implementation of these techniques on the benchmark dataset results in an optimized feature set. Experimental results show that CardiaTics delivers 89.3% accuracy on raw data, and significantly improves its accuracy after feature selection to 93.3%, outperforming the individual classifiers. However, can human professionals rely on algorithms for prediction when the underlying process is not fully understood? To address concerns regarding interpretability, trust, and transparency in black-box predictions, we propose utilizing SHapley Additive exPlanations (SHAP) and Explain Like I'm 5 (ELI5) in the second phase to elucidate feature importance in our model. The SHAP summary plots of CardiaTics reveal that the positive and negative contributors to heart disease are comparable, thereby enhancing the model's interpretability and reliability and helping refine the decision-making process.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"13 1","pages":"59"},"PeriodicalIF":6.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13068695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147673668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"<i>F</i>u<i>n</i>Da: scalable serverless data analytics and in situ query processing.","authors":"Elyes Lounissi, Suvam Kumar Das, Ronnit Peter, Xiaozheng Zhang, Suprio Ray, Lianyin Jia","doi":"10.1186/s40537-025-01141-6","DOIUrl":"https://doi.org/10.1186/s40537-025-01141-6","url":null,"abstract":"<p><p>The pay-what-you-use model of serverless Cloud computing (or serverless, for short) offers significant benefits to the users. This computing paradigm is ideal for short running ephemeral tasks, however, it is not suitable for stateful long running tasks, such as complex data analytics and query processing. We propose <i>F</i>u<i>n</i>Da, an on-premises serverless data analytics framework, which extends our previously proposed system for unified data analytics and in situ SQL query processing called DaskDB. Unlike existing serverless solutions, which struggle with stateful and long running data analytics tasks, <i>F</i>u<i>n</i>Da overcomes their limitations. Our ongoing research focuses on developing a robust architecture for <i>F</i>u<i>n</i>Da, enabling true serverless in on-premises environments, while being able to operate on a public Cloud, such as AWS Cloud. We have evaluated our system on several benchmarks with different scale factors. Our experimental results in both on-premises and AWS Cloud settings demonstrate <i>F</i>u<i>n</i>Da's ability to support automatic scaling, low-latency execution of data analytics workloads, and more flexibility to serverless users.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"116"},"PeriodicalIF":8.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12064580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143991480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Journal of Big DataPub Date : 2025-01-01Epub Date: 2025-11-17DOI: 10.1186/s40537-025-01307-2
Abdulrauf A Gidado, C I Ezeife
{"title":"UniqueNOSD: a novel framework for NoSQL over SQL databases.","authors":"Abdulrauf A Gidado, C I Ezeife","doi":"10.1186/s40537-025-01307-2","DOIUrl":"https://doi.org/10.1186/s40537-025-01307-2","url":null,"abstract":"<p><p>To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run 'SQL over NoSQL' systems such as Dynamo, Google's Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a 'Unique NoSQL over SQL Database' (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value ([Formula: see text]) data storage strategy. Unlike relational database model where a relation is represented as [Formula: see text], with a key attribute [Formula: see text] and [Formula: see text] is the primary key to the set of attributes [Formula: see text] of the relation, in [Formula: see text] (represented as a tuple (<i>K</i>, <i>B</i>) where <i>K</i> means key and <i>B</i> means block). We represent a relation as [Formula: see text] with a key attribute <i>K</i> and a set of <i>n</i> relations (i.e., <i>r</i>) called blocks <i>B</i> and each <i>r</i> [Formula: see text] contains a set of its own attributes and is denoted as [Formula: see text] with a key attribute <i>k</i> and a set of <i>n</i> attributes typical to a relational model. The relations [Formula: see text] in <i>R</i> of [Formula: see text] are related through foreign key relationships. Using existing benchmark systems of 'SQL over NoSQL', relational databases and real-life datasets for our experiments, we demonstrated that our NoSQL over SQL system outperforms existing relational databases, SQL over NoSQL systems and is novel in ensuring data consistency, scalability, query execution and improving data storage and retrieval in large database systems without data loss and enhancing improved performan","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":"255"},"PeriodicalIF":6.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12628391/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145563910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy
{"title":"Optimizing poultry audio signal classification with deep learning and burn layer fusion","authors":"Esraa Hassan, Samar Elbedwehy, Mahmoud Y. Shams, Tarek Abd El-Hafeez, Nora El-Rashidy","doi":"10.1186/s40537-024-00985-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00985-8","url":null,"abstract":"<p>This study introduces a novel deep learning-based approach for classifying poultry audio signals, incorporating a custom Burn Layer to enhance model robustness. The methodology integrates digital audio signal processing, convolutional neural networks (CNNs), and the innovative Burn Layer, which injects controlled random noise during training to reinforce the model's resilience to input signal variations. The proposed architecture is streamlined, with convolutional blocks, densely connected layers, dropout, and an additional Burn Layer to fortify robustness. The model demonstrates efficiency by reducing trainable parameters to 191,235, compared to traditional architectures with over 1.7 million parameters. The proposed model utilizes a Burn Layer with burn intensity as a parameter and an Adamax optimizer to optimize and address the overfitting problem. Thorough evaluation using six standard classification metrics showcases the model's superior performance, achieving exceptional sensitivity (96.77%), specificity (100.00%), precision (100.00%), negative predictive value (NPV) (95.00%), accuracy (98.55%), F1 score (98.36%), and Matthew’s correlation coefficient (MCC) (95.88%). This research contributes valuable insights into the fields of audio signal processing, animal health monitoring, and robust deep-learning classification systems. The proposed model presents a systematic approach for developing and evaluating a deep learning-based poultry audio classification system. It processes raw audio data and labels to generate digital representations, utilizes a Burn Layer for training variability, and constructs a CNN model with convolutional blocks, pooling, and dense layers. The model is optimized using the Adamax algorithm and trained with data augmentation and early-stopping techniques. Rigorous assessment on a test dataset using standard metrics demonstrates the model's robustness and efficiency, with the potential to significantly advance animal health monitoring and disease detection through audio signal analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset
{"title":"Machine learning and deep learning models based grid search cross validation for short-term solar irradiance forecasting","authors":"Doaa El-Shahat, Ahmed Tolba, Mohamed Abouhawwash, Mohamed Abdel-Basset","doi":"10.1186/s40537-024-00991-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00991-w","url":null,"abstract":"<p>In late 2023, the United Nations conference on climate change (COP28), which was held in Dubai, encouraged a quick move from fossil fuels to renewable energy. Solar energy is one of the most promising forms of energy that is both sustainable and renewable. Generally, photovoltaic systems transform solar irradiance into electricity. Unfortunately, instability and intermittency in solar radiation can lead to interruptions in electricity production. The accurate forecasting of solar irradiance guarantees sustainable power production even when solar irradiance is not present. Batteries can store solar energy to be used during periods of solar absence. Additionally, deterministic models take into account the specification of technical PV systems and may be not accurate for low solar irradiance. This paper presents a comparative study for the most common Deep Learning (DL) and Machine Learning (ML) algorithms employed for short-term solar irradiance forecasting. The dataset was gathered in Islamabad during a five-year period, from 2015 to 2019, at hourly intervals with accurate meteorological sensors. Furthermore, the Grid Search Cross Validation (GSCV) with five folds is introduced to ML and DL models for optimizing the hyperparameters of these models. Several performance metrics are used to assess the algorithms, such as the <i>Adjusted R</i><sup><i>2</i></sup><i> score</i>, <i>Normalized Root Mean Square Error</i> (NRMSE), <i>Mean Absolute Deviation</i> (MAD), <i>Mean Absolute Error</i> (MAE) and <i>Mean Square Error</i> (MSE). The statistical analysis shows that CNN-LSTM outperforms its counterparts of nine well-known DL models with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.984. For ML algorithms, gradient boosting regression is an effective forecasting method with <i>Adjusted R</i><sup><i>2</i></sup><i> score</i> value of 0.962, beating its rivals of six ML models. Furthermore, SHAP and LIME are examples of explainable Artificial Intelligence (XAI) utilized for understanding the reasons behind the obtained results.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"13 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin
{"title":"Shielding networks: enhancing intrusion detection with hybrid feature selection and stack ensemble learning","authors":"Ali Mohammed Alsaffar, Mostafa Nouri-Baygi, Hamed M. Zolbanin","doi":"10.1186/s40537-024-00994-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00994-7","url":null,"abstract":"<p>The frequent usage of computer networks and the Internet has made computer networks vulnerable to numerous attacks, highlighting the critical need to enhance the precision of security mechanisms. One of the most essential measures to safeguard networking resources and infrastructures is an intrusion detection system (IDS). IDSs are widely used to detect, identify, and track malicious threats. Although various machine learning algorithms have been used successfully in IDSs, they are still suffering from low prediction performances. One reason behind the low accuracy of IDSs is that existing network traffic datasets have high computational complexities that are mainly caused by redundant, incomplete, and irrelevant features. Furthermore, standalone classifiers exhibit restricted classification performance and typically fail to produce satisfactory outcomes when dealing with imbalanced, multi-category traffic data. To address these issues, we propose an efficient intrusion detection model, which is based on hybrid feature selection and stack ensemble learning. Our hybrid feature selection method, called MI-Boruta, combines mutual information (MI) as a filter method and the Boruta algorithm as a wrapper method to determine optimal features from our datasets. Then, we apply stacked ensemble learning by using random forest (RF), Catboost, and XGBoost algorithms as base learners with multilayer perceptron (MLP) as meta-learner. We test our intrusion detection model on two widely recognized benchmark datasets, namely UNSW-NB15 and CICIDS2017. We show that our proposed IDS outperforms existing IDSs in almost all performance criteria, including accuracy, recall, precision, F1-Score, false positive rate, true positive rate, and error rate.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"19 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating microarray-based spatial transcriptomics and RNA-seq reveals tissue architecture in colorectal cancer","authors":"Zheng Li, Xiaojie Zhang, Chongyuan Sun, Zefeng Li, He Fei, Dongbing Zhao","doi":"10.1186/s40537-024-00992-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00992-9","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The tumor microenvironment (TME) provides a region for intricate interactions within or between immune and non-immune cells. We aimed to reveal the tissue architecture and comprehensive landscape of cells within the TME of colorectal cancer (CRC).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Fresh frozen invasive adenocarcinoma of the large intestine tissue from 10× Genomics Datasets was obtained from BioIVT Asterand. The integration of microarray-based spatial transcriptomics (ST) and RNA sequencing (RNA-seq) was applied to characterize gene expression and cell landscape within the TME of CRC tissue architecture. Multiple R packages and deconvolution algorithms including MCPcounter, XCELL, EPIC, and ESTIMATE methods were performed for further immune distribution analysis.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The subpopulations of immune and non-immune cells within the TME of the CRC tissue architecture were appropriately annotated. According to ST and RNA-seq analyses, a heterogeneous spatial atlas of gene distribution and cell landscape was comprehensively characterized. We distinguished between the cancer and stromal regions of CRC tissues. As expected, epithelial cells were located in the cancerous region, whereas fibroblasts were mainly located in the stroma. In addition, the fibroblasts were further subdivided into two subgroups (F1 and F2) according to the differentially expressed genes (DEGs), which were mainly enriched in pathways including hallmark-oxidative-phosphorylation, hallmark-e2f-targets and hallmark-unfolded-protein-response. Furthermore, the top 5 DEGs, SPP1, CXCL10, APOE, APOC1, and LYZ, were found to be closely related to immunoregulation of the TME, methylation, and survival of CRC patients.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study characterized the heterogeneous spatial landscape of various cell subtypes within the TME of the tissue architecture. The TME-related roles of fibroblast subsets addressed the potential crosstalk among diverse cells.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142253793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu
{"title":"Development and evaluation of a deep learning model for automatic segmentation of non-perfusion area in fundus fluorescein angiography","authors":"Wei Feng, Bingjie Wang, Dan Song, Mengda Li, Anming Chen, Jing Wang, Siyong Lin, Yiran Zhao, Bin Wang, Zongyuan Ge, Shuyi Xu, Yuntao Hu","doi":"10.1186/s40537-024-00968-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00968-9","url":null,"abstract":"<p>Diabetic retinopathy (DR) is the most prevalent cause of preventable vision loss worldwide, imposing a significant economic and medical burden on society today, of which early identification is the cornerstones of the management. The diagnosis and severity grading of DR rely on scales based on clinical visualized features, but lack detailed quantitative parameters. Retinal non-perfusion area (NPA) is a pathogenic characteristic of DR that symbolizes retinal hypoxia conditions, and was found to be intimately associated with disease progression, prognosis, and management. However, the practical value of NPA is constrained since it appears on fundus fluorescein angiography (FFA) as distributed, irregularly shaped, darker plaques that are challenging to measure manually. In this study, we propose a deep learning-based method, NPA-Net, for accurate and automatic segmentation of NPAs from FFA images acquired in clinical practice. NPA-Net uses the U-net structure as the basic backbone, which has an encoder-decoder model structure. To enhance the recognition performance of the model for NPA, we adaptively incorporate multi-scale features and contextual information in feature learning and design three modules: Adaptive Encoder Feature Fusion (AEFF) module, Multilayer Deep Supervised Loss, and Atrous Spatial Pyramid Pooling (ASPP) module, which enhance the recognition ability of the model for NPAs of different sizes from different perspectives. We conducted extensive experiments on a clinical dataset with 163 eyes with NPAs manually annotated by ophthalmologists, and NPA-Net achieved better segmentation performance compared to other existing methods with an area under the receiver operating characteristic curve (AUC) of 0.9752, accuracy of 0.9431, sensitivity of 0.8794, specificity of 0.9459, IOU of 0.3876 and Dice of 0.5686. This new automatic segmentation model is useful for identifying NPA in clinical practice, generating quantitative parameters that can be useful for further research as well as guiding DR detection, grading severity, treatment planning, and prognosis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"37 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging large-scale genetic data to assess the causal impact of COVID-19 on multisystemic diseases","authors":"Xiangyang Zhang, Zhaohui Jiang, Jiayao Ma, Yaru Qi, Yin Li, Yan Zhang, Yihan Liu, Chaochao Wei, Yihong Chen, Ping Liu, Yinghui Peng, Jun Tan, Ying Han, Shan Zeng, Changjing Cai, Hong Shen","doi":"10.1186/s40537-024-00997-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00997-4","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3><p>The long-term impacts of COVID-19 on human health are a major concern, yet comprehensive evaluations of its effects on various health conditions are lacking.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>This study aims to evaluate the role of various diseases in relation to COVID-19 by analyzing genetic data from a large-scale population over 2,000,000 individuals. A bidirectional two-sample Mendelian randomization approach was used, with exposures including COVID-19 susceptibility, hospitalization, and severity, and outcomes encompassing 86 different diseases or traits. A reverse Mendelian randomization analysis was performed to assess the impact of these diseases on COVID-19.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>Our analysis identified causal relationships between COVID-19 susceptibility and several conditions, including breast cancer (OR = 1.0073, 95% CI = 1.0032–1.0114, <i>p</i> = 5 × 10 − 4), ER + breast cancer (OR = 0.5252, 95% CI = 0.3589–0.7685, <i>p</i> = 9 × 10 − 4), and heart failure (OR = 1.0026, 95% CI = 1.001–1.0042, <i>p</i> = 0.002). COVID-19 hospitalization was causally linked to heart failure (OR = 1.0017, 95% CI = 1.0006–1.0028, <i>p</i> = 0.002) and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). COVID-19 severity had causal effects on primary biliary cirrhosis (OR = 2.6333, 95% CI = 1.8274–3.7948, <i>p</i> = 2.059 × 10 − 7), celiac disease (OR = 0.0708, 95% CI = 0.0538–0.0932, <i>p</i> = 9.438 × 10–80), and Alzheimer’s disease (OR = 1.5092, 95% CI = 1.1942–1.9072, <i>p</i> = 0.0006). Reverse MR analysis indicated that rheumatoid arthritis, diabetic nephropathy, multiple sclerosis, and total testosterone (female) influence COVID-19 outcomes. We assessed heterogeneity and horizontal pleiotropy to ensure result reliability and employed the Steiger directionality test to confirm the direction of causality.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This study provides a comprehensive analysis of the causal relationships between COVID-19 and diverse health conditions. Our findings highlight the long-term impacts of COVID-19 on human health, emphasizing the need for continuous monitoring and targeted interventions for affected individuals. Future research should explore these relationships to develop comprehensive healthcare strategies.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"1 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}