{"title":"通过主动学习引导问题难度估算的标记工作","authors":"Arthur Thuy, Ekaterina Loginova, Dries F. Benoit","doi":"arxiv-2409.09258","DOIUrl":null,"url":null,"abstract":"In recent years, there has been a surge in research on Question Difficulty\nEstimation (QDE) using natural language processing techniques.\nTransformer-based neural networks achieve state-of-the-art performance,\nprimarily through supervised methods but with an isolated study in unsupervised\nlearning. While supervised methods focus on predictive performance, they\nrequire abundant labeled data. On the other hand, unsupervised methods do not\nrequire labeled data but rely on a different evaluation metric that is also\ncomputationally expensive in practice. This work bridges the research gap by\nexploring active learning for QDE, a supervised human-in-the-loop approach\nstriving to minimize the labeling efforts while matching the performance of\nstate-of-the-art models. The active learning process iteratively trains on a\nlabeled subset, acquiring labels from human experts only for the most\ninformative unlabeled data points. Furthermore, we propose a novel acquisition\nfunction PowerVariance to add the most informative samples to the labeled set,\na regression extension to the PowerBALD function popular in classification. We\nemploy DistilBERT for QDE and identify informative samples by applying Monte\nCarlo dropout to capture epistemic uncertainty in unlabeled samples. The\nexperiments demonstrate that active learning with PowerVariance acquisition\nachieves a performance close to fully supervised models after labeling only 10%\nof the training data. The proposed methodology promotes the responsible use of\neducational resources, makes QDE tools more accessible to course instructors,\nand is promising for other applications such as personalized support systems\nand question-answering tools.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Active Learning to Guide Labeling Efforts for Question Difficulty Estimation\",\"authors\":\"Arthur Thuy, Ekaterina Loginova, Dries F. Benoit\",\"doi\":\"arxiv-2409.09258\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, there has been a surge in research on Question Difficulty\\nEstimation (QDE) using natural language processing techniques.\\nTransformer-based neural networks achieve state-of-the-art performance,\\nprimarily through supervised methods but with an isolated study in unsupervised\\nlearning. While supervised methods focus on predictive performance, they\\nrequire abundant labeled data. On the other hand, unsupervised methods do not\\nrequire labeled data but rely on a different evaluation metric that is also\\ncomputationally expensive in practice. This work bridges the research gap by\\nexploring active learning for QDE, a supervised human-in-the-loop approach\\nstriving to minimize the labeling efforts while matching the performance of\\nstate-of-the-art models. The active learning process iteratively trains on a\\nlabeled subset, acquiring labels from human experts only for the most\\ninformative unlabeled data points. Furthermore, we propose a novel acquisition\\nfunction PowerVariance to add the most informative samples to the labeled set,\\na regression extension to the PowerBALD function popular in classification. We\\nemploy DistilBERT for QDE and identify informative samples by applying Monte\\nCarlo dropout to capture epistemic uncertainty in unlabeled samples. The\\nexperiments demonstrate that active learning with PowerVariance acquisition\\nachieves a performance close to fully supervised models after labeling only 10%\\nof the training data. The proposed methodology promotes the responsible use of\\neducational resources, makes QDE tools more accessible to course instructors,\\nand is promising for other applications such as personalized support systems\\nand question-answering tools.\",\"PeriodicalId\":501340,\"journal\":{\"name\":\"arXiv - STAT - Machine Learning\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09258\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Active Learning to Guide Labeling Efforts for Question Difficulty Estimation
In recent years, there has been a surge in research on Question Difficulty
Estimation (QDE) using natural language processing techniques.
Transformer-based neural networks achieve state-of-the-art performance,
primarily through supervised methods but with an isolated study in unsupervised
learning. While supervised methods focus on predictive performance, they
require abundant labeled data. On the other hand, unsupervised methods do not
require labeled data but rely on a different evaluation metric that is also
computationally expensive in practice. This work bridges the research gap by
exploring active learning for QDE, a supervised human-in-the-loop approach
striving to minimize the labeling efforts while matching the performance of
state-of-the-art models. The active learning process iteratively trains on a
labeled subset, acquiring labels from human experts only for the most
informative unlabeled data points. Furthermore, we propose a novel acquisition
function PowerVariance to add the most informative samples to the labeled set,
a regression extension to the PowerBALD function popular in classification. We
employ DistilBERT for QDE and identify informative samples by applying Monte
Carlo dropout to capture epistemic uncertainty in unlabeled samples. The
experiments demonstrate that active learning with PowerVariance acquisition
achieves a performance close to fully supervised models after labeling only 10%
of the training data. The proposed methodology promotes the responsible use of
educational resources, makes QDE tools more accessible to course instructors,
and is promising for other applications such as personalized support systems
and question-answering tools.