{"title":"缺失数据分析的新进展","authors":"L. A. van der Ark, Jeroen K. Vermunt","doi":"10.1027/1614-2241/A000001","DOIUrl":null,"url":null,"abstract":"In this special issue you will find four papers on handling missing data. All papers have been presented at the 2007 Fall Meeting of Social Science Division of the Dutch Statistical Society (VVS-OR) in Tilburg, The Netherlands. Together, these four papers give an excellent overview of state of the art in missing data analysis. To date, in virtually all fields of the social sciences, researchers are required to deal sophistically with missing data. Ignoring the problem, for example, by simply removing all observations that contain missing data or thoughtlessly applying software that makes the problem go away may lead to seriously biased statistical results and wrong conclusions, and is no longer an option. Instead the researcher must consider the reasons why some of the data are missing and act accordingly. Given that in the social sciences most data are obtained from respondents who responded to tests, questionnaires, surveys, or stimuli in an experimental setting, the first option that comes to mind is approaching those respondents with missing scores again, ask them the reason for their nonresponse, and ask them to respond yet. Unfortunately, this is usually not a realistic option and the researcher must rely on statistical solutions. One way of dealing with missing data is to incorporate the mechanism that caused the missingness into the statistical modeling of the data. In the context of educational measurement, Goegebeur, De Boeck, and Molenberghs (2010) discuss test speededness, which refers to the phenomenon that respondents do not respond to certain items in the test or examination due to a lack of time. They clearly explain how speededness can be incorporated into the statistical model. Using this model-based approach, they show how to identify respondents whose scores were affected by speededness. Advantage of this approach is that it allows the researcher to deal with data that are not missing at random. In some situations, it will not be possible to translate the researcher’s theories on the missingness mechanism into a statistical model because such theories are too complex or not available. Probably the best known strategy to deal with missing data is to assume that the missing scores are missing at random and conduct (multiple) imputation: Replacing the missing scores in the data by plausible values. Two papers discuss imputation methods. First, Van Ginkel, Sijtsma, Van der Ark, and Vermunt (2010) investigated the occurrence of missing data and current practices of handling nonresponse in test and questionnaire data in personality psychology. They found that in the large majority of published research reporting missing data, either the handling of missing data was not discussed, cases with missing values were deleted, or ad hoc procedures were used. In order to improve the use of appropriate methods they proposed using Method Two-Way for handling missing data in test and questionnaire data. Method Two-Way is a multiple imputation that is easy to understand and to use. Simulation studies showed that, with respect to statistics often used in the analysis of test and questionnaire data, Method Two-Way yields results comparable to the results obtained with technically more advanced methods. In the second paper on multiple imputation, Van Buuren (2010) discusses Fully Conditional Specification to impute scores for missing values. Fully Conditional Specification can be regarded as a technically more advanced method, which is available in software packages such as R and SPSS. In a simulation study, Van Buuren (2010) shows that Fully Conditional Specification outperforms Method TwoWay in the computation of Cronbach’s alpha. Because the papers by Van Ginkel et al. (2010) and Van Buuren (2010) reach different conclusions with respect to Method Two-Way, we believe some editorial comments are in order to explain the different results. We believe that both papers are of high quality but have a different focus. First, the percentages of missing data differ in the study by Van Buuren (2010) and the studies referred to by Van Ginkel et al. (2010) On the one hand, Van Buuren (2010) compared Method Two-Way and Fully Conditional Specification using large percentages of missingness (44–78%), showing a superior performance of the technically more advanced method over the simple method, under extreme circumstances. On the other hand, Van Ginkel et al. (2010) showed that in practice the percentage of missingness is much lower (on average 9% of the response vectors had at least one missing observation), and referred to studies in which the percentage of missingness ranged from 1 to 20, showing a similar performance of simple and involved methods under typical circumstances. Moreover, with high percentages of missingness a more sophisticated Bayesian version of Method Two-Way (Van Ginkel, Van der Ark,","PeriodicalId":18476,"journal":{"name":"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences","volume":"6 1","pages":"1-2"},"PeriodicalIF":2.0000,"publicationDate":"2010-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1027/1614-2241/A000001","citationCount":"10","resultStr":"{\"title\":\"New Developments in Missing Data Analysis\",\"authors\":\"L. A. van der Ark, Jeroen K. Vermunt\",\"doi\":\"10.1027/1614-2241/A000001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this special issue you will find four papers on handling missing data. All papers have been presented at the 2007 Fall Meeting of Social Science Division of the Dutch Statistical Society (VVS-OR) in Tilburg, The Netherlands. Together, these four papers give an excellent overview of state of the art in missing data analysis. To date, in virtually all fields of the social sciences, researchers are required to deal sophistically with missing data. Ignoring the problem, for example, by simply removing all observations that contain missing data or thoughtlessly applying software that makes the problem go away may lead to seriously biased statistical results and wrong conclusions, and is no longer an option. Instead the researcher must consider the reasons why some of the data are missing and act accordingly. Given that in the social sciences most data are obtained from respondents who responded to tests, questionnaires, surveys, or stimuli in an experimental setting, the first option that comes to mind is approaching those respondents with missing scores again, ask them the reason for their nonresponse, and ask them to respond yet. Unfortunately, this is usually not a realistic option and the researcher must rely on statistical solutions. One way of dealing with missing data is to incorporate the mechanism that caused the missingness into the statistical modeling of the data. In the context of educational measurement, Goegebeur, De Boeck, and Molenberghs (2010) discuss test speededness, which refers to the phenomenon that respondents do not respond to certain items in the test or examination due to a lack of time. They clearly explain how speededness can be incorporated into the statistical model. Using this model-based approach, they show how to identify respondents whose scores were affected by speededness. Advantage of this approach is that it allows the researcher to deal with data that are not missing at random. In some situations, it will not be possible to translate the researcher’s theories on the missingness mechanism into a statistical model because such theories are too complex or not available. Probably the best known strategy to deal with missing data is to assume that the missing scores are missing at random and conduct (multiple) imputation: Replacing the missing scores in the data by plausible values. Two papers discuss imputation methods. First, Van Ginkel, Sijtsma, Van der Ark, and Vermunt (2010) investigated the occurrence of missing data and current practices of handling nonresponse in test and questionnaire data in personality psychology. They found that in the large majority of published research reporting missing data, either the handling of missing data was not discussed, cases with missing values were deleted, or ad hoc procedures were used. In order to improve the use of appropriate methods they proposed using Method Two-Way for handling missing data in test and questionnaire data. Method Two-Way is a multiple imputation that is easy to understand and to use. Simulation studies showed that, with respect to statistics often used in the analysis of test and questionnaire data, Method Two-Way yields results comparable to the results obtained with technically more advanced methods. In the second paper on multiple imputation, Van Buuren (2010) discusses Fully Conditional Specification to impute scores for missing values. Fully Conditional Specification can be regarded as a technically more advanced method, which is available in software packages such as R and SPSS. In a simulation study, Van Buuren (2010) shows that Fully Conditional Specification outperforms Method TwoWay in the computation of Cronbach’s alpha. Because the papers by Van Ginkel et al. (2010) and Van Buuren (2010) reach different conclusions with respect to Method Two-Way, we believe some editorial comments are in order to explain the different results. We believe that both papers are of high quality but have a different focus. First, the percentages of missing data differ in the study by Van Buuren (2010) and the studies referred to by Van Ginkel et al. (2010) On the one hand, Van Buuren (2010) compared Method Two-Way and Fully Conditional Specification using large percentages of missingness (44–78%), showing a superior performance of the technically more advanced method over the simple method, under extreme circumstances. On the other hand, Van Ginkel et al. (2010) showed that in practice the percentage of missingness is much lower (on average 9% of the response vectors had at least one missing observation), and referred to studies in which the percentage of missingness ranged from 1 to 20, showing a similar performance of simple and involved methods under typical circumstances. Moreover, with high percentages of missingness a more sophisticated Bayesian version of Method Two-Way (Van Ginkel, Van der Ark,\",\"PeriodicalId\":18476,\"journal\":{\"name\":\"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences\",\"volume\":\"6 1\",\"pages\":\"1-2\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2010-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1027/1614-2241/A000001\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.1027/1614-2241/A000001\",\"RegionNum\":3,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PSYCHOLOGY, MATHEMATICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1027/1614-2241/A000001","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHOLOGY, MATHEMATICAL","Score":null,"Total":0}
引用次数: 10
摘要
在本期特刊中,您将找到四篇关于处理丢失数据的论文。所有论文已在荷兰蒂尔堡举行的荷兰统计学会(VVS-OR)社会科学部2007年秋季会议上发表。总之,这四篇论文给出了在缺失数据分析的艺术状态的一个很好的概述。迄今为止,在几乎所有的社会科学领域,研究人员都需要巧妙地处理缺失的数据。忽略这个问题,例如,通过简单地删除所有包含缺失数据的观察结果或轻率地应用使问题消失的软件可能导致严重偏颇的统计结果和错误的结论,并且不再是一种选择。相反,研究人员必须考虑一些数据丢失的原因,并采取相应的行动。考虑到在社会科学中,大多数数据都是从在实验环境中对测试、问卷、调查或刺激做出回应的受访者那里获得的,我想到的第一个选择是再次接近那些分数缺失的受访者,询问他们不回应的原因,并要求他们立即回应。不幸的是,这通常不是一个现实的选择,研究人员必须依靠统计解决方案。处理缺失数据的一种方法是将导致缺失的机制合并到数据的统计建模中。在教育测量的背景下,Goegebeur, De Boeck, and Molenberghs(2010)讨论了测试速度,它是指被调查者由于缺乏时间而对测试或考试中的某些项目不做出反应的现象。他们清楚地解释了如何将速度纳入统计模型。使用这种基于模型的方法,他们展示了如何识别得分受速度影响的受访者。这种方法的优点是它允许研究人员处理不是随机丢失的数据。在某些情况下,将研究人员关于缺失机制的理论转化为统计模型是不可能的,因为这些理论过于复杂或不可用。处理缺失数据的最佳策略可能是假设缺失的分数是随机缺失的,并进行(多重)imputation:用可信的值替换数据中缺失的分数。两篇论文讨论了归算方法。首先,Van Ginkel, Sijtsma, Van der Ark, and vermont(2010)调查了人格心理学中测试和问卷数据中缺失数据的发生和处理无反应的现行做法。他们发现,在绝大多数报告缺失数据的已发表研究中,要么没有讨论对缺失数据的处理,要么删除了缺失值的案例,要么使用了特别程序。为了提高方法的适用性,提出了采用方法双向法处理试验数据和问卷数据中的缺失数据。方法双向是一种容易理解和使用的多重输入。仿真研究表明,对于测试和问卷数据分析中经常使用的统计数据,Method two所获得的结果与技术上更先进的方法所获得的结果相当。在第二篇关于多重输入的论文中,Van Buuren(2010)讨论了完全条件规范来输入缺失值的分数。完全条件规范可以看作是技术上更高级的方法,在R和SPSS等软件包中都有。在一项模拟研究中,Van Buuren(2010)表明,在计算Cronbach 's alpha时,完全条件规范优于Method TwoWay。由于Van Ginkel et al.(2010)和Van Buuren(2010)的论文就Method Two-Way得出了不同的结论,我们认为一些编辑评论是为了解释不同的结果。我们认为这两篇论文都是高质量的,但侧重点不同。首先,Van Buuren(2010)的研究和Van Ginkel等人(2010)的研究中缺失数据的百分比不同。一方面,Van Buuren(2010)使用大缺失百分比(44-78%)比较了方法双向和完全条件规范,在极端情况下,技术上更先进的方法比简单的方法表现出更优越的性能。另一方面,Van Ginkel et al.(2010)表明,在实践中缺失的百分比要低得多(平均9%的响应向量至少有一个缺失观测值),并参考了缺失百分比在1到20之间的研究,在典型情况下,简单而复杂的方法表现相似。此外,由于缺失率很高,更复杂的贝叶斯版本的双向方法(Van Ginkel, Van der Ark,
In this special issue you will find four papers on handling missing data. All papers have been presented at the 2007 Fall Meeting of Social Science Division of the Dutch Statistical Society (VVS-OR) in Tilburg, The Netherlands. Together, these four papers give an excellent overview of state of the art in missing data analysis. To date, in virtually all fields of the social sciences, researchers are required to deal sophistically with missing data. Ignoring the problem, for example, by simply removing all observations that contain missing data or thoughtlessly applying software that makes the problem go away may lead to seriously biased statistical results and wrong conclusions, and is no longer an option. Instead the researcher must consider the reasons why some of the data are missing and act accordingly. Given that in the social sciences most data are obtained from respondents who responded to tests, questionnaires, surveys, or stimuli in an experimental setting, the first option that comes to mind is approaching those respondents with missing scores again, ask them the reason for their nonresponse, and ask them to respond yet. Unfortunately, this is usually not a realistic option and the researcher must rely on statistical solutions. One way of dealing with missing data is to incorporate the mechanism that caused the missingness into the statistical modeling of the data. In the context of educational measurement, Goegebeur, De Boeck, and Molenberghs (2010) discuss test speededness, which refers to the phenomenon that respondents do not respond to certain items in the test or examination due to a lack of time. They clearly explain how speededness can be incorporated into the statistical model. Using this model-based approach, they show how to identify respondents whose scores were affected by speededness. Advantage of this approach is that it allows the researcher to deal with data that are not missing at random. In some situations, it will not be possible to translate the researcher’s theories on the missingness mechanism into a statistical model because such theories are too complex or not available. Probably the best known strategy to deal with missing data is to assume that the missing scores are missing at random and conduct (multiple) imputation: Replacing the missing scores in the data by plausible values. Two papers discuss imputation methods. First, Van Ginkel, Sijtsma, Van der Ark, and Vermunt (2010) investigated the occurrence of missing data and current practices of handling nonresponse in test and questionnaire data in personality psychology. They found that in the large majority of published research reporting missing data, either the handling of missing data was not discussed, cases with missing values were deleted, or ad hoc procedures were used. In order to improve the use of appropriate methods they proposed using Method Two-Way for handling missing data in test and questionnaire data. Method Two-Way is a multiple imputation that is easy to understand and to use. Simulation studies showed that, with respect to statistics often used in the analysis of test and questionnaire data, Method Two-Way yields results comparable to the results obtained with technically more advanced methods. In the second paper on multiple imputation, Van Buuren (2010) discusses Fully Conditional Specification to impute scores for missing values. Fully Conditional Specification can be regarded as a technically more advanced method, which is available in software packages such as R and SPSS. In a simulation study, Van Buuren (2010) shows that Fully Conditional Specification outperforms Method TwoWay in the computation of Cronbach’s alpha. Because the papers by Van Ginkel et al. (2010) and Van Buuren (2010) reach different conclusions with respect to Method Two-Way, we believe some editorial comments are in order to explain the different results. We believe that both papers are of high quality but have a different focus. First, the percentages of missing data differ in the study by Van Buuren (2010) and the studies referred to by Van Ginkel et al. (2010) On the one hand, Van Buuren (2010) compared Method Two-Way and Fully Conditional Specification using large percentages of missingness (44–78%), showing a superior performance of the technically more advanced method over the simple method, under extreme circumstances. On the other hand, Van Ginkel et al. (2010) showed that in practice the percentage of missingness is much lower (on average 9% of the response vectors had at least one missing observation), and referred to studies in which the percentage of missingness ranged from 1 to 20, showing a similar performance of simple and involved methods under typical circumstances. Moreover, with high percentages of missingness a more sophisticated Bayesian version of Method Two-Way (Van Ginkel, Van der Ark,