Educational and Psychological Measurement最新文献_第5页

Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks. 基于功能主成分聚类和神经网络的项目难度分类。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2025-01-04 DOI: 10.1177/00131644241299834

James Zoucha, Igor Himelfarb, Nai-En Tang

{"title":"Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks.","authors":"James Zoucha, Igor Himelfarb, Nai-En Tang","doi":"10.1177/00131644241299834","DOIUrl":"https://doi.org/10.1177/00131644241299834","url":null,"abstract":"Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved very easy to easy items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241299834"},"PeriodicalIF":2.1,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142930042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Factor Retention in Exploratory Multidimensional Item Response Theory. 探索性多维项目反应理论中的因素保留。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2025-01-04 DOI: 10.1177/00131644241306680

Changsheng Chen, Robbe D'hondt, Celine Vens, Wim Van den Noortgate

{"title":"Factor Retention in Exploratory Multidimensional Item Response Theory.","authors":"Changsheng Chen, Robbe D'hondt, Celine Vens, Wim Van den Noortgate","doi":"10.1177/00131644241306680","DOIUrl":"https://doi.org/10.1177/00131644241306680","url":null,"abstract":"Multidimensional Item Response Theory (MIRT) is applied routinely in developing educational and psychological assessment tools, for instance, for exploring multidimensional structures of items using exploratory MIRT. A critical decision in exploratory MIRT analyses is the number of factors to retain. Unfortunately, the comparative properties of statistical methods and innovative Machine Learning (ML) methods for factor retention in exploratory MIRT analyses are still not clear. This study aims to fill this gap by comparing a selection of statistical and ML methods, including Kaiser Criterion (KC), Empirical Kaiser Criterion (EKC), Parallel Analysis (PA), scree plot (OC and AF), Very Simple Structure (VSS; C1 and C2), Minimum Average Partial (MAP), Exploratory Graph Analysis (EGA), Random Forest (RF), Histogram-based Gradient Boosted Decision Trees (HistGBDT), eXtreme Gradient Boosting (XGBoost), and Artificial Neural Network (ANN). The comparison was performed using 720,000 dichotomous response data sets simulated by the MIRT, for various between-item and within-item structures and considering characteristics of large-scale assessments. The results show that MAP, RF, HistGBDT, XGBoost, and ANN tremendously outperform other methods. Among them, HistGBDT generally performs better than other methods. Furthermore, including statistical methods' results as training features improves ML methods' performance. The methods' correct-factoring proportions decrease with an increase in missingness or a decrease in sample size. KC, PA, EKC, and scree plot (OC) are over-factoring, while EGA, scree plot (AF), and VSS (C1) are under-factoring. We recommend that practitioners use both MAP and HistGBDT to determine the number of factors when applying exploratory MIRT.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241306680"},"PeriodicalIF":2.1,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699551/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Examination of ChatGPT's Performance as a Data Analysis Tool. ChatGPT作为数据分析工具的性能检验。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2025-01-03 DOI: 10.1177/00131644241302721

Duygu Koçak

{"title":"Examination of ChatGPT's Performance as a Data Analysis Tool.","authors":"Duygu Koçak","doi":"10.1177/00131644241302721","DOIUrl":"https://doi.org/10.1177/00131644241302721","url":null,"abstract":"This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser-Meyer-Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser-Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241302721"},"PeriodicalIF":2.1,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11696938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142931005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions. 用IRTree方法建模缺失数据对不同仿真条件下参数估计的影响。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-12-23 DOI: 10.1177/00131644241306024

Yeşim Beril Soğuksu, Ergül Demir

{"title":"The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions.","authors":"Yeşim Beril Soğuksu, Ergül Demir","doi":"10.1177/00131644241306024","DOIUrl":"10.1177/00131644241306024","url":null,"abstract":"This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241306024"},"PeriodicalIF":2.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Treating Noneffortful Responses as Missing. 将不费力的回应视为缺失。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-29 DOI: 10.1177/00131644241297925

Christine E DeMars

引用次数: 0

Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data. 通过反应过程数据探索解释差异项目功能的证据。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-29 DOI: 10.1177/00131644241298975

Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley

{"title":"Exploring the Evidence to Interpret Differential Item Functioning via Response Process Data.","authors":"Ziying Li, Jinnie Shin, Huan Kuang, A Corinne Huggins-Manley","doi":"10.1177/00131644241298975","DOIUrl":"https://doi.org/10.1177/00131644241298975","url":null,"abstract":"Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees' response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241298975"},"PeriodicalIF":2.1,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths. 区间反应格式的区分效力：调查区间宽度的维度结构。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-25 DOI: 10.1177/00131644241283400

Matthias Kloft, Daniel W Heck

{"title":"Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths.","authors":"Matthias Kloft, Daniel W Heck","doi":"10.1177/00131644241283400","DOIUrl":"10.1177/00131644241283400","url":null,"abstract":"In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241283400"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Novick Meets Bayes: Improving the Assessment of Individual Students in Educational Practice and Research by Capitalizing on Assessors' Prior Beliefs. 诺维克与贝叶斯：利用评估者的先验信念，改进教育实践和研究中对学生个体的评估。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-25 DOI: 10.1177/00131644241296139

Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht

引用次数: 0

Differential Item Functioning Effect Size Use for Validity Information. 差异项目功能效应大小用于有效性信息。

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-22 DOI: 10.1177/00131644241293694

W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch

{"title":"Differential Item Functioning Effect Size Use for Validity Information.","authors":"W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch","doi":"10.1177/00131644241293694","DOIUrl":"10.1177/00131644241293694","url":null,"abstract":"There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln <math> <mrow> <msub> <mrow> <mover><mrow><mi>OR</mi></mrow> <mo>¯</mo></mover> </mrow> <mrow><mi>FE</mi></mrow> </msub> </mrow> </math> ) and the variance of the Mantel-Haenszel log odds ratio ( <math> <mrow> <msup> <mrow> <mover><mrow><mi>τ</mi></mrow> <mo>^</mo></mover> </mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241293694"},"PeriodicalIF":2.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs. 获得稳定动态拟合指数临界值的最佳重复次数

IF 2.1 3区心理学

Educational and Psychological Measurement Pub Date : 2024-11-08 DOI: 10.1177/00131644241290172

Xinran Liu, Daniel McNeish

{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241290172"},"PeriodicalIF":2.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0