Biodata MiningPub Date : 2023-11-25DOI: 10.1186/s13040-023-00350-0
Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman
{"title":"Endoscopy-based IBD identification by a quantized deep learning pipeline.","authors":"Massimiliano Datres, Elisa Paolazzi, Marco Chierici, Matteo Pozzi, Antonio Colangelo, Marcello Dorian Donzella, Giuseppe Jurman","doi":"10.1186/s13040-023-00350-0","DOIUrl":"10.1186/s13040-023-00350-0","url":null,"abstract":"<p><strong>Background: </strong>Discrimination between patients affected by inflammatory bowel diseases and healthy controls on the basis of endoscopic imaging is an challenging problem for machine learning models. Such task is used here as the testbed for a novel deep learning classification pipeline, powered by a set of solutions enhancing characterising elements such as reproducibility, interpretability, reduced computational workload, bias-free modeling and careful image preprocessing.</p><p><strong>Results: </strong>First, an automatic preprocessing procedure is devised, aimed to remove artifacts from clinical data, feeding then the resulting images to an aggregated per-patient model to mimic the clinicians decision process. The predictions are based on multiple snapshots obtained through resampling, reducing the risk of misleading outcomes by removing the low confidence predictions. Each patient's outcome is explained by returning the images the prediction is based upon, supporting clinicians in verifying diagnoses without the need for evaluating the full set of endoscopic images. As a major theoretical contribution, quantization is employed to reduce the complexity and the computational cost of the model, allowing its deployment on small power devices with an almost negligible 3% performance degradation. Such quantization procedure holds relevance not only in the context of per-patient models but also for assessing its feasibility in providing real-time support to clinicians even in low-resources environments. The pipeline is demonstrated on a private dataset of endoscopic images of 758 IBD patients and 601 healthy controls, achieving Matthews Correlation Coefficient 0.9 as top performance on test set.</p><p><strong>Conclusion: </strong>We highlighted how a comprehensive pre-processing pipeline plays a crucial role in identifying and removing artifacts from data, solving one of the principal challenges encountered when working with clinical data. Furthermore, we constructively showed how it is possible to emulate clinicians decision process and how it offers significant advantages, particularly in terms of explainability and trust within the healthcare context. Last but not least, we proved that quantization can be a useful tool to reduce the time and resources consumption with an acceptable degradation of the model performs. The quantization study proposed in this work points up the potential development of real-time quantized algorithms as valuable tools to support clinicians during endoscopy procedures.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"33"},"PeriodicalIF":4.5,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10675910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138435274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-11-15DOI: 10.1186/s13040-023-00349-7
Sana Munquad, Asim Bikas Das
{"title":"DeepAutoGlioma: a deep learning autoencoder-based multi-omics data integration and classification tools for glioma subtyping.","authors":"Sana Munquad, Asim Bikas Das","doi":"10.1186/s13040-023-00349-7","DOIUrl":"10.1186/s13040-023-00349-7","url":null,"abstract":"<p><strong>Background and objective: </strong>The classification of glioma subtypes is essential for precision therapy. Due to the heterogeneity of gliomas, the subtype-specific molecular pattern can be captured by integrating and analyzing high-throughput omics data from different genomic layers. The development of a deep-learning framework enables the integration of multi-omics data to classify the glioma subtypes to support the clinical diagnosis.</p><p><strong>Results: </strong>Transcriptome and methylome data of glioma patients were preprocessed, and differentially expressed features from both datasets were identified. Subsequently, a Cox regression analysis determined genes and CpGs associated with survival. Gene set enrichment analysis was carried out to examine the biological significance of the features. Further, we identified CpG and gene pairs by mapping them in the promoter region of corresponding genes. The methylation and gene expression levels of these CpGs and genes were embedded in a lower-dimensional space with an autoencoder. Next, ANN and CNN were used to classify subtypes using the latent features from embedding space. CNN performs better than ANN for subtyping lower-grade gliomas (LGG) and glioblastoma multiforme (GBM). The subtyping accuracy of CNN was 98.03% (± 0.06) and 94.07% (± 0.01) in LGG and GBM, respectively. The precision of the models was 97.67% in LGG and 90.40% in GBM. The model sensitivity was 96.96% in LGG and 91.18% in GBM. Additionally, we observed the superior performance of CNN with external datasets. The genes and CpGs pairs used to develop the model showed better performance than the random CpGs-gene pairs, preprocessed data, and single omics data.</p><p><strong>Conclusions: </strong>The current study showed that a novel feature selection and data integration strategy led to the development of DeepAutoGlioma, an effective framework for diagnosing glioma subtypes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"32"},"PeriodicalIF":4.5,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10652591/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134650252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research agenda for using artificial intelligence in health governance: interpretive scoping review and framework.","authors":"Maryam Ramezani, Amirhossein Takian, Ahad Bakhtiari, Hamid R Rabiee, Sadegh Ghazanfari, Saharnaz Sazgarnejad","doi":"10.1186/s13040-023-00346-w","DOIUrl":"10.1186/s13040-023-00346-w","url":null,"abstract":"<p><strong>Background: </strong>The governance of health systems is complex in nature due to several intertwined and multi-dimensional factors contributing to it. Recent challenges of health systems reflect the need for innovative approaches that can minimize adverse consequences of policies. Hence, there is compelling evidence of a distinct outlook on the health ecosystem using artificial intelligence (AI). Therefore, this study aimed to investigate the roles of AI and its applications in health system governance through an interpretive scoping review of current evidence.</p><p><strong>Method: </strong>This study intended to offer a research agenda and framework for the applications of AI in health systems governance. To include shreds of evidence with a greater focus on the application of AI in health governance from different perspectives, we searched the published literature from 2000 to 2023 through PubMed, Scopus, and Web of Science Databases.</p><p><strong>Results: </strong>Our findings showed that integrating AI capabilities into health systems governance has the potential to influence three cardinal dimensions of health. These include social determinants of health, elements of governance, and health system tasks and goals. AI paves the way for strengthening the health system's governance through various aspects, i.e., intelligence innovations, flexible boundaries, multidimensional analysis, new insights, and cognition modifications to the health ecosystem area.</p><p><strong>Conclusion: </strong>AI is expected to be seen as a tool with new applications and capabilities, with the potential to change each component of governance in the health ecosystem, which can eventually help achieve health-related goals.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"31"},"PeriodicalIF":4.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71414915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-10-26DOI: 10.1186/s13040-023-00347-9
Qian Yan, Wenjiang Zheng, Boqing Wang, Baoqian Ye, Huiyan Luo, Xinqian Yang, Ping Zhang, Xiongwen Wang
{"title":"Correction: A prognostic model based on seven immune-related genes predicts the overall survival of patients with hepatocellular carcinoma.","authors":"Qian Yan, Wenjiang Zheng, Boqing Wang, Baoqian Ye, Huiyan Luo, Xinqian Yang, Ping Zhang, Xiongwen Wang","doi":"10.1186/s13040-023-00347-9","DOIUrl":"10.1186/s13040-023-00347-9","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"30"},"PeriodicalIF":4.5,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10605871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54231823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prescription pattern analysis of Type 2 Diabetes Mellitus: a cross-sectional study in Isfahan, Iran.","authors":"Elnaz Ziad, Somayeh Sadat, Farshad Farzadfar, Mohammad-Reza Malekpour","doi":"10.1186/s13040-023-00344-y","DOIUrl":"10.1186/s13040-023-00344-y","url":null,"abstract":"<p><strong>Background: </strong>Patients with Type 2 Diabetes Mellitus (T2DM) are at a higher risk of polypharmacy and more susceptible to irrational prescriptions; therefore, pharmacological therapy patterns are important to be monitored. The primary objective of this study was to highlight current prescription patterns in T2DM patients and compare them with existing Standards of Medical Care in Diabetes. The second objective was to analyze whether age and gender affect prescription patterns.</p><p><strong>Method: </strong>This cross-sectional study was conducted using the Iran Health Insurance Organization (IHIO) prescription database. It was mined by an Association Rule Mining (ARM) technique, FP-Growth, in order to find co-prescribed drugs with anti-diabetic medications. The algorithm was implemented at different levels of the Anatomical Therapeutic Chemical (ATC) classification system, which assigns different codes to drugs based on their anatomy, pharmacological, therapeutic, and chemical properties to provide an in-depth analysis of co-prescription patterns.</p><p><strong>Results: </strong>Altogether, the prescriptions of 914,652 patients were analyzed, of whom 91,505 were found to have diabetes. According to our results, prescribing Lipid Modifying Agents (C10) (56.3%), Agents Acting on The Renin-Angiotensin System (C09) (48.9%), Antithrombotic Agents (B01) (35.7%), and Beta Blocking Agents (C07) (30.1%) were meaningfully associated with the prescription of Drugs Used in Diabetes. Our study also revealed that female diabetic patients have a higher lift for taking Thyroid Preparations, and the older the patients were, the more they were prone to take neuropathy-related medications. Additionally, the results suggest that there are gender differences in the association between aspirin and diabetes drugs, with the differences becoming less pronounced in old age.</p><p><strong>Conclusions: </strong>Almost all of the association rules found in this research were clinically meaningful, proving the potential of ARM for co-prescription pattern discovery. Moreover, implementing level-based ARM was effective in detecting difficult-to-spot rules. Additionally, the majority of drugs prescribed by physicians were consistent with the Standards of Medical Care in Diabetes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"29"},"PeriodicalIF":4.5,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10588025/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49683949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-10-09DOI: 10.1186/s13040-023-00345-x
Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai
{"title":"Attention-based dual-path feature fusion network for automatic skin lesion segmentation.","authors":"Zhenxiang He, Xiaoxia Li, Yuling Chen, Nianzu Lv, Yong Cai","doi":"10.1186/s13040-023-00345-x","DOIUrl":"10.1186/s13040-023-00345-x","url":null,"abstract":"<p><p>Automatic segmentation of skin lesions is a critical step in Computer Aided Diagnosis (CAD) of melanoma. However, due to the blurring of the lesion boundary, uneven color distribution, and low image contrast, resulting in poor segmentation result. Aiming at the problem of difficult segmentation of skin lesions, this paper proposes an Attention-based Dual-path Feature Fusion Network (ADFFNet) for automatic skin lesion segmentation. Firstly, in the spatial path, a Boundary Refinement (BR) module is designed for the output of low-level features to filter out irrelevant background information and retain more boundary details of the lesion area. Secondly, in the context path, a Multi-scale Feature Selection (MFS) module is constructed for high-level feature output to capture multi-scale context information and use the attention mechanism to filter out redundant semantic information. Finally, we design a Dual-path Feature Fusion (DFF) module, which uses high-level global attention information to guide the step-by-step fusion of high-level semantic features and low-level detail features, which is beneficial to restore image detail information and further improve the pixel-level segmentation accuracy of skin lesion. In the experiment, the ISIC 2018 and PH2 datasets are employed to evaluate the effectiveness of the proposed method. It achieves a performance of 0.890/ 0.925 and 0.933 /0.954 on the F1-score and SE index, respectively. Comparative analysis with state-of-the-art segmentation methods reveals that the ADFFNet algorithm exhibits superior segmentation performance.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"28"},"PeriodicalIF":4.5,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10561442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41155445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-10-06DOI: 10.1186/s13040-023-00343-z
Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis
{"title":"Quantum analysis of squiggle data.","authors":"Naya Nagy, Matthew Stuart-Edwards, Marius Nagy, Liam Mitchell, Athanasios Zovoilis","doi":"10.1186/s13040-023-00343-z","DOIUrl":"10.1186/s13040-023-00343-z","url":null,"abstract":"<p><p>Squiggle data is the numerical output of DNA and RNA sequencing by the Nanopore next generation sequencing platform. Nanopore sequencing offers expanded applications compared to previous sequencing techniques but produces a large amount of data in the form of current measurements over time. The analysis of these segments of current measurements require more complex and computationally intensive algorithms than previous sequencing technologies. The purpose of this study is to investigate in principle the potential of using quantum computers to speed up Nanopore data analysis. Quantum circuits are designed to extract major features of squiggle current measurements. The circuits are analyzed theoretically in terms of size and performance. Practical experiments on IBM QX show the limitations of the state of the art quantum computer to tackle real life squiggle data problems. Nevertheless, pre-processing of the squiggle data using the inverse wavelet transform, as experimented and analyzed in this paper as well, reduces the dimensionality of the problem in order to fit a reasonable size quantum computer in the hopefully near future.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"27"},"PeriodicalIF":4.5,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41135068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-09-26DOI: 10.1186/s13040-023-00341-1
Sofia Martins, Roberta Coletti, Marta B Lopes
{"title":"Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods.","authors":"Sofia Martins, Roberta Coletti, Marta B Lopes","doi":"10.1186/s13040-023-00341-1","DOIUrl":"10.1186/s13040-023-00341-1","url":null,"abstract":"<p><p>Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"26"},"PeriodicalIF":4.5,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10523751/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41161853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-09-04DOI: 10.1186/s13040-023-00342-0
John T Gregg, Jason H Moore
{"title":"STAR_outliers: a python package that separates univariate outliers from non-normal distributions.","authors":"John T Gregg, Jason H Moore","doi":"10.1186/s13040-023-00342-0","DOIUrl":"10.1186/s13040-023-00342-0","url":null,"abstract":"<p><p>There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"25"},"PeriodicalIF":4.5,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476292/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10166430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biodata MiningPub Date : 2023-08-22DOI: 10.1186/s13040-023-00340-2
Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla
{"title":"Machine learning based study for the classification of Type 2 diabetes mellitus subtypes.","authors":"Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla","doi":"10.1186/s13040-023-00340-2","DOIUrl":"10.1186/s13040-023-00340-2","url":null,"abstract":"<p><strong>Purpose: </strong>Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.</p><p><strong>Methods: </strong>We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.</p><p><strong>Results: </strong>From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.</p><p><strong>Conclusion: </strong>The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"24"},"PeriodicalIF":4.5,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10463725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}