Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion

IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY
Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu
{"title":"Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion","authors":"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu","doi":"10.1007/s11634-024-00582-w","DOIUrl":null,"url":null,"abstract":"<p>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size <i>N</i>, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size <i>N</i> is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only <span>\\(N_i&lt;N\\)</span> observations for variable <i>i</i>, which means that using the ‘complete’ sample size <i>N</i> implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC<sub>inc</sub>. The novelty is that HBIC<sub>inc</sub> only uses the actual amounts of observed information, namely <span>\\(N_i\\)</span>’s, in the penalty term. Theoretically, it is shown that HBIC<sub>inc</sub> is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC<sub>inc</sub>, which means that HBIC<sub>inc</sub> shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC<sub>inc</sub>, BIC, and related criteria with various missing rates. The results show that HBIC<sub>inc</sub> and BIC perform similarly when the missing rate is small, but HBIC<sub>inc</sub> is more accurate when the missing rate is not small.\n</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"92 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11634-024-00582-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size N is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only \(N_i<N\) observations for variable i, which means that using the ‘complete’ sample size N implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBICinc. The novelty is that HBICinc only uses the actual amounts of observed information, namely \(N_i\)’s, in the penalty term. Theoretically, it is shown that HBICinc is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBICinc, which means that HBICinc shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBICinc, BIC, and related criteria with various missing rates. The results show that HBICinc and BIC perform similarly when the missing rate is small, but HBICinc is more accurate when the missing rate is not small.

Abstract Image

通过新型分层贝叶斯信息准则选择不完整数据因子分析中的因子数量
贝叶斯信息准则(BIC)的定义是观测数据对数似然值减去基于样本量 N 的惩罚项,它是完整数据因素分析中常用的模型选择准则。这一定义也适用于不完整数据。然而,基于 "完整 "样本量 N 的惩罚项无论在完整数据还是不完整数据情况下都是一样的。对于不完整数据,变量 i 通常只有 \(N_i<N\) 个观测值,这意味着使用 "完整 "样本量 N 会难以置信地忽略不完整数据中固有的缺失信息量。鉴于此,我们提出了一种新的分层 BIC(HBIC)准则,用于不完整数据的因子分析,用 HBICinc 表示。其新颖之处在于,HBICinc 只在惩罚项中使用观察到的实际信息量,即 \(N_i\)。从理论上讲,HBICinc 是变异贝叶斯(VB)下限的大样本近似,而 BIC 是 HBICinc 的进一步近似,这意味着 HBICinc 与 BIC 具有相同的理论一致性。我们在合成数据集和真实数据集上进行了实验,以了解 HBICinc、BIC 和相关准则在不同缺失率下的有限样本性能。结果表明,当缺失率较小时,HBICinc 和 BIC 的性能相似,但当缺失率不大时,HBICinc 更准确。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.40
自引率
6.20%
发文量
45
审稿时长
>12 weeks
期刊介绍: The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信