M. Wiering, H. V. Hasselt, Auke-Dirk Pietersma, Lambert Schomaker
{"title":"Reinforcement learning algorithms for solving classification problems","authors":"M. Wiering, H. V. Hasselt, Auke-Dirk Pietersma, Lambert Schomaker","doi":"10.1109/ADPRL.2011.5967372","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967372","url":null,"abstract":"We describe a new framework for applying reinforcement learning (RL) algorithms to solve classification tasks by letting an agent act on the inputs and learn value functions. This paper describes how classification problems can be modeled using classification Markov decision processes and introduces the Max-Min ACLA algorithm, an extension of the novel RL algorithm called actor-critic learning automaton (ACLA). Experiments are performed using 8 datasets from the UCI repository, where our RL method is combined with multi-layer perceptrons that serve as function approximators. The RL method is compared to conventional multi-layer perceptrons and support vector machines and the results show that our method slightly outperforms the multi-layer perceptron and performs equally well as the support vector machine. Finally, many possible extensions are described to our basic method, so that much future research can be done to make the proposed method even better.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130840259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global optimal strategies of a class of finite-horizon continuous-time nonaffine nonlinear zero-sum game using a new iteration algorithm","authors":"Xin Zhang, Huaguang Zhang, Lili Cui, Yanhong Luo","doi":"10.1109/ADPRL.2011.5967360","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967360","url":null,"abstract":"In this paper we ami to solve the global optimal strategies of a class of finite-horizon continuous-time nonaffine nonlinear zero-sum game. The idea is to use a iterative algorithm to obtain the saddle point. The iterative algorithm is between two sequences which are a sequence of linear quadratic zero-sum game and a sequence of Riccati differential equation. The necessary conditions of global optimal strategies are established. A simulation example is given to illustrate the perfoermance of the proposed approach.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127159721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Castelletti, S. Galelli, Marcello Restelli, R. Soncini-Sessa
{"title":"Tree-based variable selection for dimensionality reduction of large-scale control systems","authors":"A. Castelletti, S. Galelli, Marcello Restelli, R. Soncini-Sessa","doi":"10.1109/ADPRL.2011.5967387","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967387","url":null,"abstract":"This paper is about dimensionality reduction by variable selection in high-dimensional real-world control problems, where designing controllers by conventional means is either impractical or results in poor performance.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114389785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Higher order Q-Learning","authors":"Ashley D. Edwards, W. Pottenger","doi":"10.1109/ADPRL.2011.5967385","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967385","url":null,"abstract":"Higher order learning is a statistical relational learning framework in which relationships between different instances of the same class are leveraged (Ganiz, Lytkin and Pottenger, 2009). Learning can be supervised or unsupervised. In contrast, reinforcement learning (Q-Learning) is a technique for learning in an unknown state space. Action selection is often based on a greedy, or epsilon greedy approach. The problem with this approach is that there is often a large amount of initial exploration before convergence. In this article we introduce a novel approach to this problem that treats a state space as a collection of data from which latent information can be extrapolated. From this data, we classify actions as leading to a high reward or low reward, and formulate behaviors based on this information. We provide experimental evidence that this technique drastically reduces the amount of exploration required in the initial stages of learning. We evaluate our algorithm in a well-known reinforcement learning domain, grid-world.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active exploration by searching for experiments that falsify the computed control policy","authors":"R. Fonteneau, S. Murphy, L. Wehenkel, D. Ernst","doi":"10.1109/ADPRL.2011.5967364","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967364","url":null,"abstract":"We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identification method are given a priori. Experiments are selected if, using the learnt environment model, they are predicted to yield a revision of the learnt control policy. Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is promising.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124790807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online adaptive learning of optimal control solutions using integral reinforcement learning","authors":"K. Vamvoudakis, D. Vrabie, F. Lewis","doi":"10.1109/ADPRL.2011.5967359","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967359","url":null,"abstract":"In this paper we introduce an online algorithm that uses integral reinforcement knowledge for learning the continuous-time optimal control solution for nonlinear systems with infinite horizon costs and partial knowledge of the system dynamics. This algorithm is a data based approach to the solution of the Hamilton-Jacobi-Bellman equation and it does not require explicit knowledge on the system's drift dynamics. The adaptive algorithm is based on policy iteration, and it is implemented on an actor/critic structure. Both actor and critic neural networks are adapted simultaneously a persistence of excitation condition is required to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and stability of the system is also guaranteed. Simulation examples support the theoretical result.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129789256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of reinforcement learning-based algorithms in CO2 allowance and electricity markets","authors":"V. Nanduri","doi":"10.1109/ADPRL.2011.5967367","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967367","url":null,"abstract":"Climate change is one of the most important challenges faced by the world this century. In the U.S., the electric power industry is the largest emitter of CO2, contributing to the climate crisis. Federal emissions control bills in the form of cap-and-trade programs are currently idling in the U.S. Congress. In the mean time, ten states in the northeastern U.S. have adopted a regional cap-and-trade program to reduce CO2 levels and also to increase investments in cleaner technologies. Many of the states in which the cap-and-trade programs are active operate under a restructured market paradigm, where generators compete to supply power. This research presents a bi-level game-theoretic model to capture competition between generators in cap-and-trade markets and restructured electricity markets. The solution to the game-theoretic model is obtained using a reinforcement learning based algorithm.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128236428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-building semi-Markov adaptive critics","authors":"A. Gosavi, S. Murray, Jiaqiao Hu","doi":"10.1109/ADPRL.2011.5967374","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967374","url":null,"abstract":"Adaptive or actor critics are a class of reinforcement learning (RL) or approximate dynamic programming (ADP) algorithms in which one searches over stochastic policies in order to determine the optimal deterministic policy. Classically, these algorithms have been studied for Markov decision processes (MDPs) in the context of model-free updates in which transition probabilities are avoided altogether. A model-free version for the semi-MDP (SMDP) for discounted reward in which the transition time of each transition can be a random variable was proposed in Gosavi [1]. In this paper, we propose a variant in which the transition probability model is built simultaneously with the value function and action-probability functions. While our new algorithm does not require the transition probabilities apriori, it generates them along with the estimation of the value function and the action-probability functions required in adaptive critics. Model-building and model-based versions of algorithms have numerous advantages in contrast to their model-free counterparts. In particular, they are more stable and may require less training. However the additional steps of building the model may require increased storage in the computer's memory. In addition to enumerating potential application areas for our algorithm, we will analyze the advantages and disadvantages of model building.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134134197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feedback controller parameterizations for Reinforcement Learning","authors":"John W. Roberts, I. Manchester, Russ Tedrake","doi":"10.1109/ADPRL.2011.5967370","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967370","url":null,"abstract":"Reinforcement Learning offers a very general framework for learning controllers, but its effectiveness is closely tied to the controller parameterization used. Especially when learning feedback controllers for weakly stable systems, ineffective parameterizations can result in unstable controllers and poor performance both in terms of learning convergence and in the cost of the resulting policy. In this paper we explore four linear controller parameterizations in the context of REINFORCE, applying them to the control of a reaching task with a linearized flexible manipulator. We find that some natural but naive parameterizations perform very poorly, while the Youla Parameterization (a popular parameterization from the controls literature) offers a number of robustness and performance advantages.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125012485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information space receding horizon control","authors":"S. Chakravorty, R. Erwin","doi":"10.1109/ADPRL.2011.5967362","DOIUrl":"https://doi.org/10.1109/ADPRL.2011.5967362","url":null,"abstract":"In this paper, we present a receding horizon solution to the problem of optimal sensor scheduling problem. The optimal sensor scheduling problem can be posed as a Partially Observed Markov Decision Process (POMDP) whose solution is given by an Information Space (I-space) Dynamic Programming (DP) problem. We present a simulation based stochastic optimization technique that, combined with a receding horizon approach, obviates the need to solve the computationally intractable I-space DP problem. The technique is tested on a simple sensor scheduling problem where a sensor has to choose among the measurements of N dynamical systems such that the information regarding the aggregate system is maximized over an infinite horizon.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126365196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}