Erika Ramsdale , Yilin Zhou , Lisa Smith , Huiwen Xu , Rachael Tylock , Marie Flannery , Supriya Mohile , Ajay Anand
{"title":"Unsupervised learning to identify symptom clusters in older adults undergoing chemotherapy","authors":"Erika Ramsdale , Yilin Zhou , Lisa Smith , Huiwen Xu , Rachael Tylock , Marie Flannery , Supriya Mohile , Ajay Anand","doi":"10.1016/j.jgo.2025.102222","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Unsupervised machine learning (ML) approaches such as clustering have not been commonly applied to patient-reported data. This study describes ML methods to explore and describe patient-reported symptom trajectories in older adults receiving chemotherapy.</div></div><div><h3>Materials and Methods</h3><div>This secondary analysis of prospectively collected data from the GAP 70+ Trial (<span><span>NCT02054741</span><svg><path></path></svg></span>; PI: Mohile) collected patient-reported symptoms at baseline (pre-chemotherapy), six weeks, three months, and six months. Complete patient-reported symptom data were available for at least one timepoint for 708/718 patients (98.6 %). Correlation analysis was performed on all symptom items. Multiple clustering algorithms were applied to selected baseline symptoms as an exploratory analysis, using gap statistic and elbow plots to understand optimal cluster numbers for each algorithm. Silhouette scores and t-stochastic neighbor embedding (t-SNE) plots were generated for each algorithm. Hierarchical agglomerative clustering was applied to symptoms at each timepoint, and clusters generated for each timepoint were examined longitudinally utilizing statistical measures, violin plots, and a Sankey diagram.</div></div><div><h3>Results</h3><div>Twenty-six patient-reported items were used for clustering analyses, representing symptom severity and interference. There was significant variability in how different unsupervised learning algorithms clustered the baseline symptom data. Silhouette scores ranged from −0.22 (OPTICS) to 0.16 (BIRCH). Examining agglomerative clustering across timepoints, cluster composition was largely driven by the symptom sum score (i.e., adding the Likert-scale scores). Most patients had “low” symptoms at baseline that remained low, but symptom trajectory was otherwise heterogeneous. A small number of patients had high hand-foot/neuropathy symptoms (but low other symptoms) at six weeks, and another small cluster had high mucosal toxicity at six months. Despite specific symptom patterns in these small clusters, chemotherapy regimens varied.</div></div><div><h3>Discussion</h3><div>Unsupervised machine learning techniques may be helpful to understand longitudinal patient-reported data such as symptoms. They permit data-driven exploration, which may uncover patterns to inform hypotheses or further analysis (e.g., outcome prediction). Results of clustering analyses should be validated through further hypothesis-driven analysis. In this analysis, it was challenging to uncover consistent symptom patterns, though it suggests symptom composite (sum) scores may warrant further investigation. Clinicians should understand the philosophy, strengths, and limitations of an unsupervised machine learning approach applied to patient data.</div></div>","PeriodicalId":15943,"journal":{"name":"Journal of geriatric oncology","volume":"16 3","pages":"Article 102222"},"PeriodicalIF":3.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of geriatric oncology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1879406825000384","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GERIATRICS & GERONTOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
Unsupervised machine learning (ML) approaches such as clustering have not been commonly applied to patient-reported data. This study describes ML methods to explore and describe patient-reported symptom trajectories in older adults receiving chemotherapy.
Materials and Methods
This secondary analysis of prospectively collected data from the GAP 70+ Trial (NCT02054741; PI: Mohile) collected patient-reported symptoms at baseline (pre-chemotherapy), six weeks, three months, and six months. Complete patient-reported symptom data were available for at least one timepoint for 708/718 patients (98.6 %). Correlation analysis was performed on all symptom items. Multiple clustering algorithms were applied to selected baseline symptoms as an exploratory analysis, using gap statistic and elbow plots to understand optimal cluster numbers for each algorithm. Silhouette scores and t-stochastic neighbor embedding (t-SNE) plots were generated for each algorithm. Hierarchical agglomerative clustering was applied to symptoms at each timepoint, and clusters generated for each timepoint were examined longitudinally utilizing statistical measures, violin plots, and a Sankey diagram.
Results
Twenty-six patient-reported items were used for clustering analyses, representing symptom severity and interference. There was significant variability in how different unsupervised learning algorithms clustered the baseline symptom data. Silhouette scores ranged from −0.22 (OPTICS) to 0.16 (BIRCH). Examining agglomerative clustering across timepoints, cluster composition was largely driven by the symptom sum score (i.e., adding the Likert-scale scores). Most patients had “low” symptoms at baseline that remained low, but symptom trajectory was otherwise heterogeneous. A small number of patients had high hand-foot/neuropathy symptoms (but low other symptoms) at six weeks, and another small cluster had high mucosal toxicity at six months. Despite specific symptom patterns in these small clusters, chemotherapy regimens varied.
Discussion
Unsupervised machine learning techniques may be helpful to understand longitudinal patient-reported data such as symptoms. They permit data-driven exploration, which may uncover patterns to inform hypotheses or further analysis (e.g., outcome prediction). Results of clustering analyses should be validated through further hypothesis-driven analysis. In this analysis, it was challenging to uncover consistent symptom patterns, though it suggests symptom composite (sum) scores may warrant further investigation. Clinicians should understand the philosophy, strengths, and limitations of an unsupervised machine learning approach applied to patient data.
导言聚类等无监督机器学习(ML)方法尚未普遍应用于患者报告的数据。本研究介绍了探索和描述接受化疗的老年人患者报告症状轨迹的 ML 方法。材料与方法本研究对 GAP 70+ 试验(NCT02054741;PI:Mohile)收集的前瞻性数据进行了二次分析,收集了患者在基线(化疗前)、六周、三个月和六个月时报告的症状。708/718名患者(98.6%)至少有一个时间点的完整患者症状报告数据。对所有症状项目进行了相关性分析。对选定的基线症状采用多种聚类算法进行探索性分析,利用差距统计和肘图了解每种算法的最佳聚类数。每种算法都生成了剪影评分和 t-随机邻接嵌入(t-SNE)图。对每个时间点的症状进行分层聚类,并利用统计量、小提琴图和桑基图对每个时间点生成的聚类进行纵向检查。结果聚类分析使用了 26 个患者报告的项目,代表症状严重程度和干扰。不同的无监督学习算法对基线症状数据进行聚类的方式存在很大差异。剪影得分从-0.22(OPTICS)到0.16(BIRCH)不等。在对各时间点进行聚类时,聚类组成主要由症状总分(即李克特量表得分的总和)驱动。大多数患者在基线时症状 "较轻",且持续较轻,但症状轨迹在其他方面存在差异。少数患者在六周时手足/神经病变症状较重(但其他症状较轻),另一小部分患者在六个月时粘膜毒性较重。讨论无监督机器学习技术可能有助于理解纵向患者报告数据(如症状)。无监督机器学习技术有助于理解患者纵向报告的数据(如症状),它们允许数据驱动的探索,从而发现模式,为假设或进一步分析(如结果预测)提供依据。聚类分析的结果应通过进一步的假设驱动分析来验证。在这项分析中,发现一致的症状模式具有挑战性,不过这表明症状综合(总和)评分可能值得进一步研究。临床医生应该了解应用于患者数据的无监督机器学习方法的理念、优势和局限性。
期刊介绍:
The Journal of Geriatric Oncology is an international, multidisciplinary journal which is focused on advancing research in the treatment and survivorship issues of older adults with cancer, as well as literature relevant to education and policy development in geriatric oncology.
The journal welcomes the submission of manuscripts in the following categories:
• Original research articles
• Review articles
• Clinical trials
• Education and training articles
• Short communications
• Perspectives
• Meeting reports
• Letters to the Editor.