{"title":"大语言模型增强练习检索,促进个性化语言学习","authors":"Austin Xu, Will Monroe, K. Bicknell","doi":"10.1145/3636555.3636883","DOIUrl":null,"url":null,"abstract":"We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.","PeriodicalId":162301,"journal":{"name":"International Conference on Learning Analytics and Knowledge","volume":"108 ","pages":"284-294"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Large Language Model Augmented Exercise Retrieval for Personalized Language Learning\",\"authors\":\"Austin Xu, Will Monroe, K. Bicknell\",\"doi\":\"10.1145/3636555.3636883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.\",\"PeriodicalId\":162301,\"journal\":{\"name\":\"International Conference on Learning Analytics and Knowledge\",\"volume\":\"108 \",\"pages\":\"284-294\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Learning Analytics and Knowledge\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3636555.3636883\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Learning Analytics and Knowledge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3636555.3636883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
我们研究了在线语言学习中的零点练习检索问题,使学习者能够通过自然语言明确要求个性化练习。利用从语言学习者那里收集到的真实世界数据,我们发现向量相似性方法不能很好地捕捉练习内容与学习者用来表达他们想要学习的内容的语言之间的关系。查询与内容之间的这种语义差距大大降低了在 MS MARCO 等大型信息检索数据集上预先训练的通用检索模型的有效性。我们利用大型语言模型的生成能力来弥补这一差距,根据学习者的输入合成假设练习,然后用于搜索相关练习。我们的方法被称为 mHyER,它克服了三个挑战:(1) 缺乏用于训练的相关性标签;(2) 学习者输入内容不受限制;(3) 输入与检索候选内容之间的语义相似性较低。
Large Language Model Augmented Exercise Retrieval for Personalized Language Learning
We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.