{"title":"Exploring techniques to improve machine learning’s identification of at-risk students in physics classes","authors":"John Pace, John Hansen, John Stewart","doi":"10.1103/physrevphyseducres.20.010149","DOIUrl":null,"url":null,"abstract":"Machine learning models were constructed to predict student performance in an introductory mechanics class at a large land-grant university in the United States using data from 2061 students. Students were classified as either being at risk of failing the course (earning a D or F) or not at risk (earning an A, B, or C). The models focused on variables available in the first few weeks of the class which could potentially allow for early interventions to help at-risk students. Multiple types of variables were used in the model: in-class variables (average homework and clicker quiz scores), institutional variables [college grade point average (GPA)], and noncognitive variables (self-efficacy). The substantial imbalance between the pass and fail rates of the course, with only about 10% of students failing, required modification to the machine learning algorithms. Decision threshold tuning and upsampling were successful in improving performance for at-risk students. Logistic regression combined with a decision threshold tuned to maximize balanced accuracy yielded the strongest classifier, with a DF accuracy of 83% and an ABC accuracy of 81%. Measures of variable importance involving changes in balanced accuracy identified homework grades, clicker grades, college GPA, and the fraction of college classes successfully completed as the most important variables in predicting success in introductory physics. Noncognitive variables added little predictive power to the models. Classification models with performance near the best-performing models using the full set of variables could be constructed with very few variables (homework average, clicker scores, and college GPA) using straightforward to implement algorithms, suggesting the application of these technologies may be fairly easy to include in many physics classes.","PeriodicalId":54296,"journal":{"name":"Physical Review Physics Education Research","volume":"46 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physical Review Physics Education Research","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1103/physrevphyseducres.20.010149","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning models were constructed to predict student performance in an introductory mechanics class at a large land-grant university in the United States using data from 2061 students. Students were classified as either being at risk of failing the course (earning a D or F) or not at risk (earning an A, B, or C). The models focused on variables available in the first few weeks of the class which could potentially allow for early interventions to help at-risk students. Multiple types of variables were used in the model: in-class variables (average homework and clicker quiz scores), institutional variables [college grade point average (GPA)], and noncognitive variables (self-efficacy). The substantial imbalance between the pass and fail rates of the course, with only about 10% of students failing, required modification to the machine learning algorithms. Decision threshold tuning and upsampling were successful in improving performance for at-risk students. Logistic regression combined with a decision threshold tuned to maximize balanced accuracy yielded the strongest classifier, with a DF accuracy of 83% and an ABC accuracy of 81%. Measures of variable importance involving changes in balanced accuracy identified homework grades, clicker grades, college GPA, and the fraction of college classes successfully completed as the most important variables in predicting success in introductory physics. Noncognitive variables added little predictive power to the models. Classification models with performance near the best-performing models using the full set of variables could be constructed with very few variables (homework average, clicker scores, and college GPA) using straightforward to implement algorithms, suggesting the application of these technologies may be fairly easy to include in many physics classes.
期刊介绍:
PRPER covers all educational levels, from elementary through graduate education. All topics in experimental and theoretical physics education research are accepted, including, but not limited to:
Educational policy
Instructional strategies, and materials development
Research methodology
Epistemology, attitudes, and beliefs
Learning environment
Scientific reasoning and problem solving
Diversity and inclusion
Learning theory
Student participation
Faculty and teacher professional development