Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework

Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings Pub Date : 2013-11-11 DOI:10.1145/2528282.2528296

Parisa Mansourifard, F. Jazizadeh, B. Krishnamachari, B. Becerik-Gerber

{"title":"Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework","authors":"Parisa Mansourifard, F. Jazizadeh, B. Krishnamachari, B. Becerik-Gerber","doi":"10.1145/2528282.2528296","DOIUrl":null,"url":null,"abstract":"We consider the problem of automatically learning the optimal thermal control in a room in order to maximize the expected average satisfaction among occupants providing stochastic feedback on their comfort through a participatory sensing application. Not assuming any prior knowledge or modeling of user comfort, we first apply the classic UCB1 online learning policy for multi-armed bandits (MAB), that combines exploration (testing out certain temperatures to understand better the user preferences) with exploitation (spending more time setting temperatures that maximize average-satisfaction) for the case when the total occupancy is constant. When occupancy is time-varying, the number of possible scenarios (i.e., which particular set of occupants are present in the room) becomes exponentially large, posing a combinatorial challenge. However, we show that LLR, a recently-developed combinatorial MAB online learning algorithm that requires recording and computation of only a polynomial number of quantities can be applied to this setting, yielding a regret (cumulative gap in average satisfaction with respect to a distribution aware genie) that grows only polynomially in the number of users, and logarithmically with time. This in turn indicates that difference in unit-time satisfaction obtained by the learning policy compared to the optimal tends to 0. We quantify the performance of these online learning algorithms using real data collected from users of a participatory sensing iPhone app in a multi-occupancy room in an office building in Southern California.","PeriodicalId":184274,"journal":{"name":"Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2528282.2528296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

We consider the problem of automatically learning the optimal thermal control in a room in order to maximize the expected average satisfaction among occupants providing stochastic feedback on their comfort through a participatory sensing application. Not assuming any prior knowledge or modeling of user comfort, we first apply the classic UCB1 online learning policy for multi-armed bandits (MAB), that combines exploration (testing out certain temperatures to understand better the user preferences) with exploitation (spending more time setting temperatures that maximize average-satisfaction) for the case when the total occupancy is constant. When occupancy is time-varying, the number of possible scenarios (i.e., which particular set of occupants are present in the room) becomes exponentially large, posing a combinatorial challenge. However, we show that LLR, a recently-developed combinatorial MAB online learning algorithm that requires recording and computation of only a polynomial number of quantities can be applied to this setting, yielding a regret (cumulative gap in average satisfaction with respect to a distribution aware genie) that grows only polynomially in the number of users, and logarithmically with time. This in turn indicates that difference in unit-time satisfaction obtained by the learning policy compared to the optimal tends to 0. We quantify the performance of these online learning algorithms using real data collected from users of a participatory sensing iPhone app in a multi-occupancy room in an office building in Southern California.

查看原文本刊更多论文

个性化房间级热控制的在线学习:一个多武装强盗框架

我们考虑了自动学习房间内最优热控制的问题，以最大限度地提高居住者的预期平均满意度，通过参与式传感应用提供关于他们舒适度的随机反馈。在不假设任何先验知识或用户舒适度建模的情况下，我们首先将经典的UCB1在线学习策略应用于多武装强盗(MAB)，该策略将探索(测试特定温度以更好地了解用户偏好)与开发(花费更多时间设置温度以最大化平均满意度)相结合，用于总入住率恒定的情况。当占用率随时间变化时，可能场景的数量(即，房间中存在哪一组特定的占用者)会呈指数级增长，从而构成组合挑战。然而，我们表明，LLR，一种最近开发的组合MAB在线学习算法，只需要记录和计算一个多项式数量的量，可以应用于这种设置，产生遗憾(相对于分布感知genie的平均满意度的累积差距)，它只在用户数量上多项式增长，并随着时间呈对数增长。这反过来表明，与最优策略相比，学习策略获得的单位时间满意度的差异趋于0。我们量化了这些在线学习算法的性能，使用的是在南加州一栋办公楼的多人使用房间里，从参与式传感iPhone应用程序的用户那里收集到的真实数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings

自引率

0.00%

发文量