Driving content recommendations by building a knowledge base using weak supervision and transfer learning

Proceedings of the 13th ACM Conference on Recommender Systems Pub Date : 2019-09-10 DOI:10.1145/3298689.3346963

S. Deb

{"title":"Driving content recommendations by building a knowledge base using weak supervision and transfer learning","authors":"S. Deb","doi":"10.1145/3298689.3346963","DOIUrl":null,"url":null,"abstract":"With 2.2 million subscribers and two hundred million content views, Chegg is a centralized hub where students come to get help with writing, science, math, and other educational needs. In order to impact a student's learning capabilities we present personalized content to students. Student needs are unique based on their learning style, studying environment and many other factors. Most students will engage with a subset of the products and contents available at Chegg. In order to recommend personalized content to students we have developed a generalized Machine Learning Pipeline that is able to handle training data generation and model building for a wide range of problems. We generate a knowledge base with a hierarchy of concepts and associate student-generated content, such as chat-room data, equations, chemical formulae, reviews, etc with concepts in the knowledge base. Collecting training data to generate different parts of the knowledge base is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as snorkel[2], an open source project from Stanford, to make training data generation dramatically easier. With these methods, training data is generated by using broad stroke filters and high precision rules. The rules are modeled probabilistically to incorporate dependencies. Features are generated using transfer learning[1] from language models for classification tasks. We explored several language models and the best performance was from sentence embeddings with skip-thought vectors predicting the previous and the next sentence. The generated structured information is then used to improve product features, and enhance recommendations made to students. In this presentation I will talk about efficient methods of tagging content with categories that come from a knowledge base. Using this information we provide relevant content recommendations to students coming to Chegg for online tutoring, studying flashcards and practicing problems.","PeriodicalId":215384,"journal":{"name":"Proceedings of the 13th ACM Conference on Recommender Systems","volume":"226 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3298689.3346963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With 2.2 million subscribers and two hundred million content views, Chegg is a centralized hub where students come to get help with writing, science, math, and other educational needs. In order to impact a student's learning capabilities we present personalized content to students. Student needs are unique based on their learning style, studying environment and many other factors. Most students will engage with a subset of the products and contents available at Chegg. In order to recommend personalized content to students we have developed a generalized Machine Learning Pipeline that is able to handle training data generation and model building for a wide range of problems. We generate a knowledge base with a hierarchy of concepts and associate student-generated content, such as chat-room data, equations, chemical formulae, reviews, etc with concepts in the knowledge base. Collecting training data to generate different parts of the knowledge base is a key bottleneck in developing NLP models. Employing subject matter experts to provide annotations is prohibitively expensive. Instead, we use weak supervision and active learning techniques, with tools such as snorkel[2], an open source project from Stanford, to make training data generation dramatically easier. With these methods, training data is generated by using broad stroke filters and high precision rules. The rules are modeled probabilistically to incorporate dependencies. Features are generated using transfer learning[1] from language models for classification tasks. We explored several language models and the best performance was from sentence embeddings with skip-thought vectors predicting the previous and the next sentence. The generated structured information is then used to improve product features, and enhance recommendations made to students. In this presentation I will talk about efficient methods of tagging content with categories that come from a knowledge base. Using this information we provide relevant content recommendations to students coming to Chegg for online tutoring, studying flashcards and practicing problems.

查看原文本刊更多论文

利用弱监督和迁移学习建立知识库，推动内容推荐

Chegg拥有220万订阅者和2亿内容浏览量，是一个集中的中心，学生可以在这里获得写作、科学、数学和其他教育需求方面的帮助。为了影响学生的学习能力，我们向学生提供个性化的内容。学生的需求是独特的，这取决于他们的学习方式、学习环境和许多其他因素。大多数学生将接触Chegg提供的产品和内容的子集。为了向学生推荐个性化的内容，我们开发了一个通用的机器学习管道，它能够处理各种问题的训练数据生成和模型构建。我们生成一个具有概念层次结构的知识库，并将学生生成的内容(如聊天室数据、方程、化学公式、评论等)与知识库中的概念联系起来。收集训练数据以生成知识库的不同部分是开发自然语言处理模型的关键瓶颈。聘请主题专家提供注释的成本高得令人望而却步。相反，我们使用弱监督和主动学习技术，使用斯坦福大学开源项目snorkel[2]等工具，使训练数据生成变得非常容易。这些方法使用宽行程过滤器和高精度规则生成训练数据。对规则进行概率建模以包含依赖项。特征是使用迁移学习[1]从语言模型中生成的，用于分类任务。我们探索了几种语言模型，其中表现最好的是使用跳过思维向量预测前一句和下一句的句子嵌入。然后，生成的结构化信息用于改进产品功能，并增强对学生的推荐。在这次演讲中，我将讨论用来自知识库的类别标记内容的有效方法。使用这些信息，我们为来Chegg进行在线辅导、学习抽抽卡和练习问题的学生提供相关的内容推荐。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 13th ACM Conference on Recommender Systems

自引率

0.00%

发文量