Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry
{"title":"Fast leave-one-cluster-out cross-validation using clustered network information criterion.","authors":"Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry","doi":"10.1177/09622802251345486","DOIUrl":null,"url":null,"abstract":"<p><p>For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"9622802251345486"},"PeriodicalIF":1.6000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802251345486","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.
期刊介绍:
Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)