Semi-supervised deep matrix factorization model for clustering multi-omics data.

IF 4.8 2区医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer methods and programs in biomedicine Pub Date : 2025-10-08 DOI:10.1016/j.cmpb.2025.109094

Khanh Luong, Nirav Joshi, Richi Nayak

{"title":"Semi-supervised deep matrix factorization model for clustering multi-omics data.","authors":"Khanh Luong, Nirav Joshi, Richi Nayak","doi":"10.1016/j.cmpb.2025.109094","DOIUrl":null,"url":null,"abstract":"Background and objective: Multi-omics data are inherently high-dimensional, sparse, and noisy, posing significant challenges for clustering and integration. Conventional clustering and linear dimensionality reduction methods often fail to handle noise effectively or provide interpretability, while standard non-negative matrix factorization approaches are too shallow to capture non-linear patterns. Multi-view non-negative matrix factorization enables integration of complementary views, but it remains primarily unsupervised and seldom leverages available label information.Methods: We propose SSD-MO, a Semi-Supervised Deep Non-Negative Matrix Factorization model for Multi-Omics Data, designed to address these challenges by leveraging both labelled and unlabelled samples for enhanced data integration and clustering performance. SSD-MO combines semi-supervised learning with a multi-layer deep factorization framework, preserving local geometric structure and incorporating orthogonal and diversity constraints. Its effectiveness was validated on six multi-omics datasets from The Cancer Genome Atlas, using evaluation metrics such as clustering accuracy, normalized mutual information, and F-scores.Results: SSD-MO significantly improved clustering accuracy, achieving an increase in F-score by 9%-24% compared to unsupervised baselines and 7%-20% over semi-supervised benchmarks. Precision (64%-73%) and Recall (70%-88%) values further demonstrated its robust performance across datasets.Conclusion: This method provides a robust framework for multi-omics data integration and holds promise for applications in genomics and precision medicine.","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"273 ","pages":"109094"},"PeriodicalIF":4.8000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1016/j.cmpb.2025.109094","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Background and objective: Multi-omics data are inherently high-dimensional, sparse, and noisy, posing significant challenges for clustering and integration. Conventional clustering and linear dimensionality reduction methods often fail to handle noise effectively or provide interpretability, while standard non-negative matrix factorization approaches are too shallow to capture non-linear patterns. Multi-view non-negative matrix factorization enables integration of complementary views, but it remains primarily unsupervised and seldom leverages available label information.

Methods: We propose SSD-MO, a Semi-Supervised Deep Non-Negative Matrix Factorization model for Multi-Omics Data, designed to address these challenges by leveraging both labelled and unlabelled samples for enhanced data integration and clustering performance. SSD-MO combines semi-supervised learning with a multi-layer deep factorization framework, preserving local geometric structure and incorporating orthogonal and diversity constraints. Its effectiveness was validated on six multi-omics datasets from The Cancer Genome Atlas, using evaluation metrics such as clustering accuracy, normalized mutual information, and F-scores.

Results: SSD-MO significantly improved clustering accuracy, achieving an increase in F-score by 9%-24% compared to unsupervised baselines and 7%-20% over semi-supervised benchmarks. Precision (64%-73%) and Recall (70%-88%) values further demonstrated its robust performance across datasets.

Conclusion: This method provides a robust framework for multi-omics data integration and holds promise for applications in genomics and precision medicine.

查看原文本刊更多论文

多组学数据聚类的半监督深度矩阵分解模型。

背景与目的：多组学数据本身具有高维、稀疏和噪声的特点，这给聚类和集成带来了重大挑战。传统的聚类和线性降维方法往往不能有效地处理噪声或提供可解释性，而标准的非负矩阵分解方法太浅，无法捕获非线性模式。多视图非负矩阵分解支持互补视图的集成，但它主要是无监督的，很少利用可用的标签信息。方法：我们提出了一种用于多组学数据的半监督深度非负矩阵分解模型SSD-MO，旨在通过利用标记和未标记样本来增强数据集成和聚类性能来解决这些挑战。SSD-MO将半监督学习与多层深度分解框架相结合，保留了局部几何结构，并结合了正交约束和多样性约束。该方法的有效性在来自癌症基因组图谱的6个多组学数据集上进行了验证，使用了诸如聚类准确性、标准化互信息和f分数等评估指标。结果：SSD-MO显著提高了聚类精度，与无监督基线相比，f分数提高了9%-24%，比半监督基准提高了7%-20%。精确度（64%-73%）和召回率（70%-88%）值进一步证明了其跨数据集的鲁棒性。结论：该方法为多组学数据集成提供了一个强大的框架，在基因组学和精准医学领域具有广阔的应用前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer methods and programs in biomedicine 工程技术-工程：生物医学

CiteScore

12.30

自引率

6.60%

发文量

601

审稿时长

135 days

期刊介绍： To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.