Maik Jm Beuken, Melanie Kleynen, Susy Braun, Kees Van Berkel, Carla van der Kallen, Annemarie Koster, Hans Bosma, Tos Tjm Berendschot, Alfons Jhm Houben, Nicole Dukers-Muijrers, Joop P van den Bergh, Abraham A Kroon, Iris M Kanera
{"title":"Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study.","authors":"Maik Jm Beuken, Melanie Kleynen, Susy Braun, Kees Van Berkel, Carla van der Kallen, Annemarie Koster, Hans Bosma, Tos Tjm Berendschot, Alfons Jhm Houben, Nicole Dukers-Muijrers, Joop P van den Bergh, Abraham A Kroon, Iris M Kanera","doi":"10.2196/64479","DOIUrl":"https://doi.org/10.2196/64479","url":null,"abstract":"<p><strong>Background: </strong>Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine learning can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets.</p><p><strong>Objective: </strong>This study aimed to identify potential clusters and relevant variables among individuals with obesity using a data-driven and hypothesis-free machine learning approach.</p><p><strong>Methods: </strong>We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The factor probabilistic distance clustering algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct, minimally redundant, predictive variables, we used the statistically equivalent signature algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed the distinctiveness of the clusters through the emerged variables using the F test for continuous variables and the chi-square test for categorical variables at a confidence level of α=.001.</p><p><strong>Results: </strong>We identified 3 distinct clusters (including 4128/9188, 44.93% of all data points) among individuals with obesity (n=4128). The most significant continuous variable for distinguishing cluster 1 (n=1458) from clusters 2 and 3 combined (n=2670) was the lower energy intake (mean 1684, SD 393 kcal/day vs mean 2358, SD 635 kcal/day; P<.001). The most significant categorical variable was occupation (P<.001). A significantly higher proportion (1236/1458, 84.77%) in cluster 1 did not work compared to clusters 2 and 3 combined (1486/2670, 55.66%; P<.001). For cluster 2 (n=1521), the most significant continuous variable was a higher energy intake (mean 2755, SD 506.2 kcal/day vs mean 1749, SD 375 kcal/day; P<.001). The most significant categorical variable was sex (P<.001). A significantly higher proportion (997/1521, 65.55%) in cluster 2 were male compared to the other 2 clusters (885/2607, 33.95%; P<.001). For cluster 3 (n=1149), the most significant continuous variable was overall higher cognitive functioning (mean 0.2349, SD 0.5702 vs mean -0.3088, SD 0.7212; P<.001), and educational level was the most significant categorical variable (P<.001). A significantly higher proportion (475/1149, 41.34%) in cluster 3 received higher vocational or university education in comparison to clusters 1","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64479"},"PeriodicalIF":3.1,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143190464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Tawfik, Tait D Shanafelt, Mohsen Bayati, Jochen Profit
{"title":"Electronic Health Record Use Patterns Among Well-Being Survey Responders and Nonresponders: Longitudinal Observational Study.","authors":"Daniel Tawfik, Tait D Shanafelt, Mohsen Bayati, Jochen Profit","doi":"10.2196/64722","DOIUrl":"https://doi.org/10.2196/64722","url":null,"abstract":"<p><strong>Background: </strong>Physician surveys provide indispensable insights into physician experience, but the question of whether responders are representative can limit confidence in conclusions. Ubiquitously collected electronic health record (EHR) use data may improve understanding of the experiences of survey nonresponders in relation to responders, providing clues regarding their well-being.</p><p><strong>Objective: </strong>The aim of the study was to identify EHR use measures corresponding with physician survey responses and examine methods to estimate population-level survey results among physicians.</p><p><strong>Methods: </strong>This longitudinal observational study was conducted from 2019 through 2020 among academic and community primary care physicians. We quantified EHR use using vendor-derived and investigator-derived measures, quantified burnout symptoms using emotional exhaustion and interpersonal disengagement subscales of the Stanford Professional Fulfillment Index, and used an ensemble of response propensity-weighted penalized linear regressions to develop a burnout symptom prediction model.</p><p><strong>Results: </strong>Among 697 surveys from 477 physicians with a response rate of 80.5% (697/866), always responders were similar to nonresponders in gender (204/340, 60% vs 38/66, 58% women; P=.78) and age (median 50, IQR 40-60 years vs median 50, IQR 37.5-57.5 years; P=.88) but with higher clinical workload (median 121.5, IQR 58.5-184 vs median 34.5, IQR 0-115 appointments; P<.001), efficiency (median 5.2, IQR 4.0-6.2 vs median 4.3, IQR 0-5.6; P<.001), and proficiency (median 7.0, IQR 5.4-8.5 vs median 3.1, IQR 0-6.3; P<.001). Survey response status prediction showed an out-of-sample area under the receiver operating characteristics curve of 0.88 (95% CI 0.77-0.91). Burnout symptom prediction showed an out-of-sample area under the receiver operating characteristics curve of 0.63 (95% CI 0.57-0.70). The predicted burnout prevalence among nonresponders was 52%, higher than the observed prevalence of 28% among responders, resulting in an estimated population burnout prevalence of 31%.</p><p><strong>Conclusions: </strong>EHR use measures showed limited utility for predicting burnout symptoms but allowed discrimination between responders and nonresponders. These measures may enable qualitative interpretations of the effects of nonresponders and may inform survey response maximization efforts.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64722"},"PeriodicalIF":3.1,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143190316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning-Based Risk Factor Analysis and Prediction Model Construction for the Occurrence of Chronic Heart Failure: Health Ecologic Study.","authors":"Qian Xu, Xue Cai, Ruicong Yu, Yueyue Zheng, Guanjie Chen, Hui Sun, Tianyun Gao, Cuirong Xu, Jing Sun","doi":"10.2196/64972","DOIUrl":"https://doi.org/10.2196/64972","url":null,"abstract":"<p><strong>Background: </strong>Chronic heart failure (CHF) is a serious threat to human health, with high morbidity and mortality rates, imposing a heavy burden on the health care system and society. With the abundance of medical data and the rapid development of machine learning (ML) technologies, new opportunities are provided for in-depth investigation of the mechanisms of CHF and the construction of predictive models. The introduction of health ecology research methodology enables a comprehensive dissection of CHF risk factors from a wider range of environmental, social, and individual factors. This not only helps to identify high-risk groups at an early stage but also provides a scientific basis for the development of precise prevention and intervention strategies.</p><p><strong>Objective: </strong>This study aims to use ML to construct a predictive model of the risk of occurrence of CHF and analyze the risk of CHF from a health ecology perspective.</p><p><strong>Methods: </strong>This study sourced data from the Jackson Heart Study database. Stringent data preprocessing procedures were implemented, which included meticulous management of missing values and the standardization of data. Principal component analysis and random forest (RF) were used as feature selection techniques. Subsequently, several ML models, namely decision tree, RF, extreme gradient boosting, adaptive boosting (AdaBoost), support vector machine, naive Bayes model, multilayer perceptron, and bootstrap forest, were constructed, and their performance was evaluated. The effectiveness of the models was validated through internal validation using a 10-fold cross-validation approach on the training and validation sets. In addition, the performance metrics of each model, including accuracy, precision, sensitivity, F<sub>1</sub>-score, and area under the curve (AUC), were compared. After selecting the best model, we used hyperparameter optimization to construct a better model.</p><p><strong>Results: </strong>RF-selected features (21 in total) had an average root mean square error of 0.30, outperforming principal component analysis. Synthetic Minority Oversampling Technique and Edited Nearest Neighbors showed better accuracy in data balancing. The AdaBoost model was most effective with an AUC of 0.86, accuracy of 75.30%, precision of 0.86, sensitivity of 0.69, and F<sub>1</sub>-score of 0.76. Validation on the training and validation sets through 10-fold cross-validation gave an AUC of 0.97, an accuracy of 91.27%, a precision of 0.94, a sensitivity of 0.92, and an F<sub>1</sub>-score of 0.94. After random search processing, the accuracy and AUC of AdaBoost improved. Its accuracy was 77.68% and its AUC was 0.86.</p><p><strong>Conclusions: </strong>This study offered insights into CHF risk prediction. Future research should focus on prospective studies, diverse data, advanced techniques, longitudinal studies, and exploring factor interactions for better CHF prevention and managemen","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64972"},"PeriodicalIF":3.1,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smart Contracts and Shared Platforms in Sustainable Health Care: Systematic Review.","authors":"Carlos Antonio Marino, Claudia Diaz Paz","doi":"10.2196/58575","DOIUrl":"https://doi.org/10.2196/58575","url":null,"abstract":"<p><strong>Background: </strong>The benefits of smart contracts (SCs) for sustainable health care are a relatively recent topic that has gathered attention given its relationship with trust and the advantages of decentralization, immutability, and traceability introduced in health care. Nevertheless, more studies need to explore the role of SCs in this sector based on the frameworks propounded in the literature that reflect business logic that has been customized, automatized, and prioritized, as well as system trust. This study addressed this lacuna.</p><p><strong>Objective: </strong>This study aimed to provide a comprehensive understanding of SCs in health care based on reviewing the frameworks propounded in the literature.</p><p><strong>Methods: </strong>A structured literature review was performed based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) principles. One database-Web of Science (WoS)-was selected to avoid bias generated by database differences and data wrangling. A quantitative assessment of the studies based on machine learning and data reduction methodologies was complemented with a qualitative, in-depth, detailed review of the frameworks propounded in the literature.</p><p><strong>Results: </strong>A total of 70 studies, which constituted 18.7% (70/374) of the studies on this subject, met the selection criteria and were analyzed. A multiple correspondence analysis-with 74.44% of the inertia-produced 3 factors describing the advances in the topic. Two of them referred to the leading roles of SCs: (1) health care process enhancement and (2) assurance of patients' privacy protection. The first role included 6 themes, and the second one included 3 themes. The third factor encompassed the technical features that improve system efficiency. The in-depth review of these 3 factors and the identification of stakeholders allowed us to characterize the system trust in health care SCs. We assessed the risk of coverage bias, and good percentages of overlap were obtained-66% (49/74) of PubMed articles were also in WoS, and 88.3% (181/205) of WoS articles also appeared in Scopus.</p><p><strong>Conclusions: </strong>This comprehensive review allows us to understand the relevance of SCs and the potentiality of their use in patient-centric health care that considers more than technical aspects. It also provides insights for further research based on specific stakeholders, locations, and behaviors.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e58575"},"PeriodicalIF":3.1,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Faisal Ghaffar, Nadine M Furtado, Imad Ali, Catherine Burns
{"title":"Diagnostic Decision-Making Variability Between Novice and Expert Optometrists for Glaucoma: Comparative Analysis to Inform AI System Design.","authors":"Faisal Ghaffar, Nadine M Furtado, Imad Ali, Catherine Burns","doi":"10.2196/63109","DOIUrl":"10.2196/63109","url":null,"abstract":"<p><strong>Background: </strong>While expert optometrists tend to rely on a deep understanding of the disease and intuitive pattern recognition, those with less experience may depend more on extensive data, comparisons, and external guidance. Understanding these variations is important for developing artificial intelligence (AI) systems that can effectively support optometrists with varying degrees of experience and minimize decision inconsistencies.</p><p><strong>Objective: </strong>The main objective of this study is to identify and analyze the variations in diagnostic decision-making approaches between novice and expert optometrists. By understanding these variations, we aim to provide guidelines for the development of AI systems that can support optometrists with varying levels of expertise. These guidelines will assist in developing AI systems for glaucoma diagnosis, ultimately enhancing the diagnostic accuracy of optometrists and minimizing inconsistencies in their decisions.</p><p><strong>Methods: </strong>We conducted in-depth interviews with 14 optometrists using within-subject design, including both novices and experts, focusing on their approaches to glaucoma diagnosis. The responses were coded and analyzed using a mixed method approach incorporating both qualitative and quantitative analysis. Statistical tests such as Mann-Whitney U and chi-square tests were used to find significance in intergroup variations. These findings were further supported by themes extracted through qualitative analysis, which helped to identify decision-making patterns and understand variations in their approaches.</p><p><strong>Results: </strong>Both groups showed lower concordance rates with clinical diagnosis, with experts showing almost double (7/35, 20%) concordance rates with limited data in comparison to novices (7/69, 10%), highlighting the impact of experience and data availability on clinical judgment; this rate increased to nearly 40% for both groups (experts: 5/12, 42% and novices: 8/21, 42%) when they had access to complete historical data of the patient. We also found statistically significant intergroup differences between the first visits and subsequent visits with a P value of less than .05 on the Mann-Whitney U test in many assessments. Furthermore, approaches to the exam assessment and decision differed significantly: experts emphasized comprehensive risk assessments and progression analysis, demonstrating cognitive efficiency and intuitive decision-making, while novices relied more on structured, analytical methods and external references. Additionally, significant variations in patient follow-up times were observed, with a P value of <.001 on the chi-square test, showing a stronger influence of experience on follow-up time decisions.</p><p><strong>Conclusions: </strong>The study highlights significant variations in the decision-making process of novice and expert optometrists in glaucoma diagnosis, with experience playing a key role in ac","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63109"},"PeriodicalIF":3.1,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Digital Representation of Patients as Medical Digital Twins: Data-Centric Viewpoint.","authors":"Stanislas Demuth, Jérôme De Sèze, Gilles Edan, Tjalf Ziemssen, Françoise Simon, Pierre-Antoine Gourraud","doi":"10.2196/53542","DOIUrl":"10.2196/53542","url":null,"abstract":"<p><strong>Unlabelled: </strong>Precision medicine involves a paradigm shift toward personalized data-driven clinical decisions. The concept of a medical \"digital twin\" has recently become popular to designate digital representations of patients as a support for a wide range of data science applications. However, the concept is ambiguous when it comes to practical implementations. Here, we propose a medical digital twin framework with a data-centric approach. We argue that a single digital representation of patients cannot support all the data uses of digital twins for technical and regulatory reasons. Instead, we propose a data architecture leveraging three main families of digital representations: (1) multimodal dashboards integrating various raw health records at points of care to assist with perception and documentation, (2) virtual patients, which provide nonsensitive data for collective secondary uses, and (3) individual predictions that support clinical decisions. For a given patient, multiple digital representations may be generated according to the different clinical pathways the patient goes through, each tailored to balance the trade-offs associated with the respective intended uses. Therefore, our proposed framework conceives the medical digital twin as a data architecture leveraging several digital representations of patients along clinical pathways.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e53542"},"PeriodicalIF":3.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143069851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
María de la Paz Scribano Parada, Fátima González Palau, Sonia Valladares Rodríguez, Mariano Rincon, Maria José Rico Barroeta, Marta García Rodriguez, Yolanda Bueno Aguado, Ana Herrero Blanco, Estela Díaz-López, Margarita Bachiller Mayoral, Raquel Losada Durán
{"title":"Preclinical Cognitive Markers of Alzheimer Disease and Early Diagnosis Using Virtual Reality and Artificial Intelligence: Literature Review.","authors":"María de la Paz Scribano Parada, Fátima González Palau, Sonia Valladares Rodríguez, Mariano Rincon, Maria José Rico Barroeta, Marta García Rodriguez, Yolanda Bueno Aguado, Ana Herrero Blanco, Estela Díaz-López, Margarita Bachiller Mayoral, Raquel Losada Durán","doi":"10.2196/62914","DOIUrl":"10.2196/62914","url":null,"abstract":"<p><strong>Background: </strong>This review explores the potential of virtual reality (VR) and artificial intelligence (AI) to identify preclinical cognitive markers of Alzheimer disease (AD). By synthesizing recent studies, it aims to advance early diagnostic methods to detect AD before significant symptoms occur.</p><p><strong>Objective: </strong>Research emphasizes the significance of early detection in AD during the preclinical phase, which does not involve cognitive impairment but nevertheless requires reliable biomarkers. Current biomarkers face challenges, prompting the exploration of cognitive behavior indicators beyond episodic memory.</p><p><strong>Methods: </strong>Using PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, we searched Scopus, PubMed, and Google Scholar for studies on neuropsychiatric disorders utilizing conversational data.</p><p><strong>Results: </strong>Following an analysis of 38 selected articles, we highlight verbal episodic memory as a sensitive preclinical AD marker, with supporting evidence from neuroimaging and genetic profiling. Executive functions precede memory decline, while processing speed is a significant correlate. The potential of VR remains underexplored, and AI algorithms offer a multidimensional approach to early neurocognitive disorder diagnosis.</p><p><strong>Conclusions: </strong>Emerging technologies like VR and AI show promise for preclinical diagnostics, but thorough validation and regulation for clinical safety and efficacy are necessary. Continued technological advancements are expected to enhance early detection and management of AD.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e62914"},"PeriodicalIF":3.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143069865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theresa Willem, Alessandro Wollek, Theodor Cheslerean-Boghiu, Martha Kenney, Alena Buyx
{"title":"The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets.","authors":"Theresa Willem, Alessandro Wollek, Theodor Cheslerean-Boghiu, Martha Kenney, Alena Buyx","doi":"10.2196/59452","DOIUrl":"https://doi.org/10.2196/59452","url":null,"abstract":"<p><strong>Background: </strong>In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.</p><p><strong>Objective: </strong>This study aimed to explore categorical data's effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets' data categories before using them for machine learning training.</p><p><strong>Methods: </strong>Against the theoretical background of the social construction of categories, we suggest a mixed methods approach to assess categorical data's utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Espírito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset authors.</p><p><strong>Results: </strong>Our quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data.</p><p><strong>Conclusions: </strong>We caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for wh","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e59452"},"PeriodicalIF":3.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Use of the FHTHWA Index as a Novel Approach for Predicting the Incidence of Diabetes in a Japanese Population Without Diabetes: Data Analysis Study.","authors":"Jiao Wang, Jianrong Chen, Ying Liu, Jixiong Xu","doi":"10.2196/64992","DOIUrl":"10.2196/64992","url":null,"abstract":"<p><strong>Background: </strong>Many tools have been developed to predict the risk of diabetes in a population without diabetes; however, these tools have shortcomings that include the omission of race, inclusion of variables that are not readily available to patients, and low sensitivity or specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an easy, systematic index for predicting diabetes risk in the Asian population.</p><p><strong>Methods: </strong>We collected the data from the NAGALA (NAfld [nonalcoholic fatty liver disease] in the Gifu Area, Longitudinal Analysis) database. The least absolute shrinkage and selection operator model was used to select potentially relevant features. Multiple Cox proportional hazard analysis was used to develop a model based on the training set.</p><p><strong>Results: </strong>The final study population of 15464 participants had a mean age of 42 (range 18-79) years; 54.5% (8430) were men. The mean follow-up duration was 6.05 (SD 3.78) years. A total of 373 (2.41%) participants showed progression to diabetes during the follow-up period. Then, we established a novel parameter (the FHTHWA index), to evaluate the incidence of diabetes in a population without diabetes, comprising 6 parameters based on the training set. After multivariable adjustment, individuals in tertile 3 had a significantly higher rate of diabetes compared with those in tertile 1 (hazard ratio 32.141, 95% CI 11.545-89.476). Time receiver operating characteristic curve analyses showed that the FHTHWA index had high accuracy, with the area under the curve value being around 0.9 during the more than 12 years of follow-up.</p><p><strong>Conclusions: </strong>This research successfully developed a diabetes risk assessment index tailored for the Japanese population by utilizing an extensive dataset and a wide range of indices. By categorizing the diabetes risk levels among Japanese individuals, this study offers a novel predictive tool for identifying potential patients, while also delivering valuable insights into diabetes prevention strategies for the healthy Japanese populace.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e64992"},"PeriodicalIF":3.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143069867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Doris Yang, Doudou Zhou, Steven Cai, Ziming Gan, Michael Pencina, Paul Avillach, Tianxi Cai, Chuan Hong
{"title":"Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study.","authors":"Doris Yang, Doudou Zhou, Steven Cai, Ziming Gan, Michael Pencina, Paul Avillach, Tianxi Cai, Chuan Hong","doi":"10.2196/54133","DOIUrl":"10.2196/54133","url":null,"abstract":"<p><strong>Background: </strong>Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult.</p><p><strong>Objective: </strong>We propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies.</p><p><strong>Methods: </strong>SONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner.</p><p><strong>Results: </strong>The method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k accuracy metrics. Notably, SONAR was able to significantly improve harmonization of concepts that were difficult for existing semantic methods to harmonize.</p><p><strong>Conclusions: </strong>SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e54133"},"PeriodicalIF":3.1,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11778729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}