Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-02-05 DOI:10.2196/64479

Maik Jm Beuken, Melanie Kleynen, Susy Braun, Kees Van Berkel, Carla van der Kallen, Annemarie Koster, Hans Bosma, Tos Tjm Berendschot, Alfons Jhm Houben, Nicole Dukers-Muijrers, Joop P van den Bergh, Abraham A Kroon, Iris M Kanera

{"title":"Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study.","authors":"Maik Jm Beuken, Melanie Kleynen, Susy Braun, Kees Van Berkel, Carla van der Kallen, Annemarie Koster, Hans Bosma, Tos Tjm Berendschot, Alfons Jhm Houben, Nicole Dukers-Muijrers, Joop P van den Bergh, Abraham A Kroon, Iris M Kanera","doi":"10.2196/64479","DOIUrl":null,"url":null,"abstract":"Background: Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine learning can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets.Objective: This study aimed to identify potential clusters and relevant variables among individuals with obesity using a data-driven and hypothesis-free machine learning approach.Methods: We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The factor probabilistic distance clustering algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct, minimally redundant, predictive variables, we used the statistically equivalent signature algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed the distinctiveness of the clusters through the emerged variables using the F test for continuous variables and the chi-square test for categorical variables at a confidence level of α=.001.Results: We identified 3 distinct clusters (including 4128/9188, 44.93% of all data points) among individuals with obesity (n=4128). The most significant continuous variable for distinguishing cluster 1 (n=1458) from clusters 2 and 3 combined (n=2670) was the lower energy intake (mean 1684, SD 393 kcal/day vs mean 2358, SD 635 kcal/day; P<.001). The most significant categorical variable was occupation (P<.001). A significantly higher proportion (1236/1458, 84.77%) in cluster 1 did not work compared to clusters 2 and 3 combined (1486/2670, 55.66%; P<.001). For cluster 2 (n=1521), the most significant continuous variable was a higher energy intake (mean 2755, SD 506.2 kcal/day vs mean 1749, SD 375 kcal/day; P<.001). The most significant categorical variable was sex (P<.001). A significantly higher proportion (997/1521, 65.55%) in cluster 2 were male compared to the other 2 clusters (885/2607, 33.95%; P<.001). For cluster 3 (n=1149), the most significant continuous variable was overall higher cognitive functioning (mean 0.2349, SD 0.5702 vs mean -0.3088, SD 0.7212; P<.001), and educational level was the most significant categorical variable (P<.001). A significantly higher proportion (475/1149, 41.34%) in cluster 3 received higher vocational or university education in comparison to clusters 1 and 2 combined (729/2979, 24.47%; P<.001).Conclusions: This study demonstrates that a hypothesis-free and fully data-driven approach can be used to identify distinguishable participant clusters in large and complex datasets and find relevant variables that differ within populations with obesity.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64479"},"PeriodicalIF":3.1000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840370/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/64479","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine learning can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets.

Objective: This study aimed to identify potential clusters and relevant variables among individuals with obesity using a data-driven and hypothesis-free machine learning approach.

Methods: We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The factor probabilistic distance clustering algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct, minimally redundant, predictive variables, we used the statistically equivalent signature algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed the distinctiveness of the clusters through the emerged variables using the F test for continuous variables and the chi-square test for categorical variables at a confidence level of α=.001.

Results: We identified 3 distinct clusters (including 4128/9188, 44.93% of all data points) among individuals with obesity (n=4128). The most significant continuous variable for distinguishing cluster 1 (n=1458) from clusters 2 and 3 combined (n=2670) was the lower energy intake (mean 1684, SD 393 kcal/day vs mean 2358, SD 635 kcal/day; P<.001). The most significant categorical variable was occupation (P<.001). A significantly higher proportion (1236/1458, 84.77%) in cluster 1 did not work compared to clusters 2 and 3 combined (1486/2670, 55.66%; P<.001). For cluster 2 (n=1521), the most significant continuous variable was a higher energy intake (mean 2755, SD 506.2 kcal/day vs mean 1749, SD 375 kcal/day; P<.001). The most significant categorical variable was sex (P<.001). A significantly higher proportion (997/1521, 65.55%) in cluster 2 were male compared to the other 2 clusters (885/2607, 33.95%; P<.001). For cluster 3 (n=1149), the most significant continuous variable was overall higher cognitive functioning (mean 0.2349, SD 0.5702 vs mean -0.3088, SD 0.7212; P<.001), and educational level was the most significant categorical variable (P<.001). A significantly higher proportion (475/1149, 41.34%) in cluster 3 received higher vocational or university education in comparison to clusters 1 and 2 combined (729/2979, 24.47%; P<.001).

Conclusions: This study demonstrates that a hypothesis-free and fully data-driven approach can be used to identify distinguishable participant clusters in large and complex datasets and find relevant variables that differ within populations with obesity.

查看原文本刊更多论文

使用机器学习识别肥胖人群中的群集：马斯特里赫特研究的二次分析。

背景：现代生活方式的风险因素，如缺乏运动和营养不良，导致肥胖和慢性疾病（如2型糖尿病和心脏病）的发病率上升。特别是个性化干预已被证明对长期的行为改变是有效的。机器学习可以在没有预定义假设的情况下发现见解，揭示复杂的关系和不同的人口集群。新的数据驱动方法，如因子概率距离聚类算法，提供了在大型复杂数据集中识别潜在有意义的聚类的机会。目的：本研究旨在使用数据驱动和无假设的机器学习方法识别肥胖个体中的潜在集群和相关变量。方法：我们使用来自马斯特里赫特研究的腹部肥胖个体的横断面数据。数据（2971个变量）包括人口统计学、生活方式、生物医学方面、晚期表型和社会因素（队列2010）。为了检测高维数据中的聚类，采用因子概率距离聚类算法。为了识别一组不同的、最小冗余的预测变量，我们使用了统计等效签名算法。为了描述这些聚类，我们采用了集中趋势和变异的度量，并通过出现的变量评估聚类的独特性，对连续变量使用F检验，对分类变量使用卡方检验，置信水平为α= 0.001。结果：我们在肥胖个体（n=4128）中鉴定出3个不同的聚类（包括4128/9188，占所有数据点的44.93%）。区分聚类1 （n=1458）与聚类2和聚类3 （n=2670）的最显著连续变量是较低的能量摄入（平均1684，SD 393 kcal/天vs平均2358，SD 635 kcal/天）；结论：本研究表明，无假设和完全数据驱动的方法可用于识别大型复杂数据集中可区分的参与者集群，并在肥胖人群中找到不同的相关变量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.