Incorporating intent propensities in personalized next best action recommendation

Yuxi Zhang, Kexin Xie
{"title":"Incorporating intent propensities in personalized next best action recommendation","authors":"Yuxi Zhang, Kexin Xie","doi":"10.1145/3298689.3346962","DOIUrl":null,"url":null,"abstract":"Next best action (NBA) is a technique that is widely considered as the best practice in modern personalized marketing. It takes users' unique characteristics into consideration and recommends next actions that help users progress towards business goals as quickly and smoothly as possible. Many NBA engines are built with rules handcrafted by marketers based on experience or gut feelings. It is not effective. In this proposal, we show our machine learning based approach for such a real-time recommendation engine, detail our design choices, and discuss evaluation techniques. In practice, there are several key challenges to consider. (a) It needs to be able to deal with historical feedback that is typically incomplete and skewed towards a small set of actions; (b) Actions are typically dynamic. They can be added or removed anytime due to seasonal changes or shifts in business strategies; (c) The optimization objective is typically complex. It usually consists of reaching a set of target events or moving users to more preferred stages. The engine needs to account for all these aspects. Standard classification or regression models are not suitable to use, because only bandit feedback is available and sampling bias presented in historical data can not be handled properly. Conventional multi-armed bandit model can address some of the challenges. But it lacks the ability to model multiple objectives. We present a propensity variant hybrid contextual multi-armed bandit model (PV-MAB) that can address all three challenges. PV-MAB consists of two components: an intent propensity model (I-Prop) and a hybrid contextual MAB (H-Bandit). H-Bandit can be considered as a multi-policy contextual MAB, where we model different aspects of user engagement separately and cater the policies to each unique characteristic. I-Prop leverages user intent signals to target different users toward specific goals that are most relevant to them. It acts as a policy selector, to inform H-Bandit to choose the best strategy for different users at different points in the journey. I-Prop is trained separately with features extracted from user profile affinities and past behaviors. To illustrate this design, we will focus our discussion on how to incorporate two common distinct objectives in H-bandit. The first one is to target and drive users to reach a small set of high-value goals (e.g. purchase, become superfan), called goal-oriented policy. The second is to promote progression into more advanced stages in a consumer journey (e.g. from login to complete profile). We call it stage-advancement policy. In the goal-oriented policy, we reward reaching the goals accordingly, and use classification predictor as kernel function to predict the probabilities for achieving those goals. In the stage-advancement policy, we use the progression of stages as reward. Customers can move forward in their journey, skip a few stages or go back to previous stages doing more research or re-evaluation. The reward strategy is designed in the way that we reward higher for bigger positive stage progression and not reward zero or negative stage progression. Both policies incorporate Thompson Sampling with Gaussian kernel for better exploration. One big difference between our hybrid model and regular contextual bandit model, is that besides context information, we also mix user profile affinities in the model. It tells us the user intent and interest, and how their typical journey path looks like. With these special features, our model is able to recommend different actions for users that shows different interests (i.e. football ticket purchase v.s. jersey purchase). Similarly, for fast shoppers who usually skip a few stages, our model recommends actions that quickly triggers goal achievement; while for research type of users, the model offers actions that move them gradually towards next stages. This hybrid strategy provides us with better understanding of user intent and behaviors, so as to make more personalized recommendations. We designed a time-sensitive rolling evaluation mechanism for offline evaluation of the system with various hyperparameters that simulate behaviors in practice. Despite the lack of online evaluation, our strategy allows researchers and prospects to gain confidence through bounded expected performance. Evaluated on real-world data, we observed about 120% of reward gain, with an overall confidence of around 0.95. A big portion of the improvement is contributed by the goal-oriented policy. It well demonstrated the discovery functionality of the intent propensity model.","PeriodicalId":215384,"journal":{"name":"Proceedings of the 13th ACM Conference on Recommender Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3298689.3346962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Next best action (NBA) is a technique that is widely considered as the best practice in modern personalized marketing. It takes users' unique characteristics into consideration and recommends next actions that help users progress towards business goals as quickly and smoothly as possible. Many NBA engines are built with rules handcrafted by marketers based on experience or gut feelings. It is not effective. In this proposal, we show our machine learning based approach for such a real-time recommendation engine, detail our design choices, and discuss evaluation techniques. In practice, there are several key challenges to consider. (a) It needs to be able to deal with historical feedback that is typically incomplete and skewed towards a small set of actions; (b) Actions are typically dynamic. They can be added or removed anytime due to seasonal changes or shifts in business strategies; (c) The optimization objective is typically complex. It usually consists of reaching a set of target events or moving users to more preferred stages. The engine needs to account for all these aspects. Standard classification or regression models are not suitable to use, because only bandit feedback is available and sampling bias presented in historical data can not be handled properly. Conventional multi-armed bandit model can address some of the challenges. But it lacks the ability to model multiple objectives. We present a propensity variant hybrid contextual multi-armed bandit model (PV-MAB) that can address all three challenges. PV-MAB consists of two components: an intent propensity model (I-Prop) and a hybrid contextual MAB (H-Bandit). H-Bandit can be considered as a multi-policy contextual MAB, where we model different aspects of user engagement separately and cater the policies to each unique characteristic. I-Prop leverages user intent signals to target different users toward specific goals that are most relevant to them. It acts as a policy selector, to inform H-Bandit to choose the best strategy for different users at different points in the journey. I-Prop is trained separately with features extracted from user profile affinities and past behaviors. To illustrate this design, we will focus our discussion on how to incorporate two common distinct objectives in H-bandit. The first one is to target and drive users to reach a small set of high-value goals (e.g. purchase, become superfan), called goal-oriented policy. The second is to promote progression into more advanced stages in a consumer journey (e.g. from login to complete profile). We call it stage-advancement policy. In the goal-oriented policy, we reward reaching the goals accordingly, and use classification predictor as kernel function to predict the probabilities for achieving those goals. In the stage-advancement policy, we use the progression of stages as reward. Customers can move forward in their journey, skip a few stages or go back to previous stages doing more research or re-evaluation. The reward strategy is designed in the way that we reward higher for bigger positive stage progression and not reward zero or negative stage progression. Both policies incorporate Thompson Sampling with Gaussian kernel for better exploration. One big difference between our hybrid model and regular contextual bandit model, is that besides context information, we also mix user profile affinities in the model. It tells us the user intent and interest, and how their typical journey path looks like. With these special features, our model is able to recommend different actions for users that shows different interests (i.e. football ticket purchase v.s. jersey purchase). Similarly, for fast shoppers who usually skip a few stages, our model recommends actions that quickly triggers goal achievement; while for research type of users, the model offers actions that move them gradually towards next stages. This hybrid strategy provides us with better understanding of user intent and behaviors, so as to make more personalized recommendations. We designed a time-sensitive rolling evaluation mechanism for offline evaluation of the system with various hyperparameters that simulate behaviors in practice. Despite the lack of online evaluation, our strategy allows researchers and prospects to gain confidence through bounded expected performance. Evaluated on real-world data, we observed about 120% of reward gain, with an overall confidence of around 0.95. A big portion of the improvement is contributed by the goal-oriented policy. It well demonstrated the discovery functionality of the intent propensity model.
在个性化的下一个最佳行动建议中纳入意图倾向
次优行动(NBA)是一种被广泛认为是现代个性化营销最佳实践的技术。它考虑到用户的独特特征,并建议下一步的操作,以帮助用户尽可能快速和顺利地实现业务目标。许多NBA引擎都是由营销人员根据经验或直觉精心设计的规则构建的。这是无效的。在本提案中,我们展示了基于机器学习的实时推荐引擎方法,详细介绍了我们的设计选择,并讨论了评估技术。在实践中,有几个关键的挑战需要考虑。(a)它需要能够处理历史反馈,这些反馈通常是不完整的,倾向于一小部分行动;(b)行动通常是动态的。可因季节变化或经营策略变化随时增减;(c)优化目标通常是复杂的。它通常包括到达一组目标事件或将用户移动到更喜欢的阶段。引擎需要考虑到所有这些方面。标准的分类或回归模型不适合使用,因为只有强盗反馈可用,不能很好地处理历史数据中呈现的抽样偏差。传统的多武装土匪模式可以解决一些挑战。但它缺乏对多个目标进行建模的能力。我们提出了一个倾向变体混合上下文多臂强盗模型(PV-MAB),可以解决这三个挑战。PV-MAB由两个部分组成:意图倾向模型(I-Prop)和混合上下文MAB (H-Bandit)。H-Bandit可以被视为一个多策略上下文MAB,其中我们分别对用户参与的不同方面进行建模,并根据每个独特的特征来迎合策略。I-Prop利用用户意图信号,将不同的用户定位到与他们最相关的特定目标。它充当策略选择器,通知H-Bandit在旅程的不同阶段为不同用户选择最佳策略。I-Prop与从用户档案亲和力和过去行为中提取的特征分开训练。为了说明这种设计,我们将重点讨论如何在H-bandit中合并两个共同的不同目标。第一种策略是瞄准并驱动用户达到一小部分高价值目标(例如购买,成为超级粉丝),称为目标导向策略。第二是促进玩家进入更高级的阶段(游戏邦注:例如从登录到完成配置文件)。我们称之为阶段推进政策。在目标导向策略中,我们对达到目标进行相应的奖励,并使用分类预测器作为核函数来预测实现这些目标的概率。在阶段推进策略中,我们使用阶段推进作为奖励。客户可以在他们的旅程中继续前进,跳过几个阶段,或者回到以前的阶段,做更多的研究或重新评估。奖励策略的设计方式是,我们对更大的积极阶段进展给予更高的奖励,而不是对零或消极阶段进展给予奖励。这两种策略都结合了汤普森抽样和高斯核,以更好地探索。我们的混合模型和常规上下文强盗模型之间的一个很大的区别是,除了上下文信息之外,我们还在模型中混合了用户配置文件的亲和力。它告诉我们用户的意图和兴趣,以及他们的典型旅程路径是什么样子的。有了这些特殊的功能,我们的模型能够为表现出不同兴趣的用户推荐不同的操作(例如,购买球票vs .购买球衣)。同样,对于通常跳过几个阶段的快速购物者,我们的模型会推荐快速触发目标实现的行动;而对于研究类型的用户,该模型提供的行动,使他们逐步进入下一个阶段。这种混合策略可以让我们更好地了解用户的意图和行为,从而做出更个性化的推荐。我们设计了一种时间敏感的滚动评估机制,用于对具有各种超参数的系统进行离线评估,以模拟实际行为。尽管缺乏在线评估,我们的策略允许研究人员和前景通过有限的预期表现获得信心。根据真实世界的数据,我们观察到奖励增益约为120%,总体置信度约为0.95。这种改善很大一部分是由目标导向的政策贡献的。它很好地展示了意图倾向模型的发现功能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信