Qlearning
Machine learning and data mining 

Machine learning venues


Qlearning is a modelfree reinforcement learning technique. Specifically, Qlearning can be used to find an optimal actionselection policy for any given (finite) Markov decision process (MDP). It works by learning an actionvalue function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an actionvalue function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Qlearning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Qlearning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Qlearning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.
Algorithm
The problem model consists of an agent, states and a set of actions per state . By performing an action , the agent can move from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by learning which action is optimal for each state. The action that is optimal for each state is the action that has the highest longterm reward. This reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state, where the weight for a step from a state steps into the future is calculated as . Here, is a number between 0 and 1 () called the discount factor and trades off the importance of sooner versus later rewards. may also be interpreted as the likelihood to succeed (or survive) at every step .
The algorithm therefore has a function that calculates the Quantity of a stateaction combination:
Before learning has started, returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
where is the reward observed after performing in , and where () is the learning rate (may be the same for all pairs).
An episode of the algorithm ends when state is a final state (or, "absorbing state"). However, Qlearning can also learn in nonepisodic tasks. If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.
Note that for all final states , is never updated but is set to the reward value . In most cases, can be taken to be equal to zero.
Influence of variables on the algorithm
Learning rate
The learning rate or step size determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. In fully deterministic environments, a learning rate of is optimal. When the problem is stochastic, the algorithm still converges under some technical conditions on the learning rate, that require it to decrease to zero. In practice, often a constant learning rate is used, such as for all .^{[1]}
Discount factor
The discount factor determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or shortsighted) by only considering current rewards, while a factor approaching 1 will make it strive for a longterm high reward. If the discount factor meets or exceeds 1, the action values may diverge. For , without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.^{[2]} Even with a discount factor only slightly lower than 1, the Qfunction learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.^{[3]} In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning.^{[4]}
Initial conditions (Q_{0})
Since Qlearning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",^{[5]} can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of . This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resettingofinitialconditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.^{[6]}
Implementation
Qlearning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research.^{[7]}
More generally, Qlearning can be combined with function approximation.^{[8]} This makes it possible to apply the algorithm to larger problems, even when the state space is continuous, and therefore infinitely large. Additionally, it may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.
Early study
Qlearning was first introduced by Watkins^{[9]} in 1989. The convergence proof was presented later by Watkins and Dayan^{[10]} in 1992.
Variants
A recent application of Qlearning to deep learning, by Google DeepMind, titled "deep reinforcement learning" or "deep Qnetworks", has been successful at playing some Atari 2600 games at expert human levels. Preliminary results were presented in 2014, with a paper published in February 2015 in Nature.^{[11]}
Because the maximum approximated action value is used in the Qlearning update, in noisy environments Qlearning can sometimes overestimate the actions values, slowing the learning. A recent variant called Double Qlearning was proposed to correct this. ^{[12]} This algorithm was later combined with deep learning, as in the DQN algorithm (see above), resulting in Double DQN which was shown to outperform the original DQN algorithm. ^{[13]}
Delayed Qlearning is an alternative implementation of the online Qlearning algorithm, with Probably approximately correct learning (PAC).^{[14]}
Greedy GQ is a variant of Qlearning to use in combination with (linear) function approximation.^{[15]} The advantage of Greedy GQ is that convergence guarantees can be given even when function approximation is used to estimate the action values.
Qlearning may suffer from slow rate of convergence, especially when the discount factor is close to one.^{[16]} Speedy Qlearning, a new variant of Qlearning algorithm, deals with this problem and achieves a probably same rate of convergence as modelbased methods such as value iteration.^{[17]}
See also
 Reinforcement learning
 Temporal difference learning
 SARSA
 Iterated prisoner's dilemma
 Game theory
 Fitted Q iteration algorithm
References
 ↑ Reinforcement Learning: An Introduction. Richard Sutton and Andrew Barto. MIT Press, 1998.
 ↑ Stuart J. Russell; Peter Norvig (2010). Artificial Intelligence: A Modern Approach (Third ed.). Prentice Hall. p. 649. ISBN 9780136042594.
 ↑ Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. ICML, pages 30–37, 1995
 ↑ FrançoisLavet Vincent, Raphael Fonteneau, Damien Ernst. "How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies". NIPS, Deep RL workshop 2015.
 ↑ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html
 ↑ Shteingart, H; Neiman, T; Loewenstein, Y (May 2013). "The Role of First Impression in Operant Learning". J Exp Psychol Gen. 142 (2): 476–88. doi:10.1037/a0029550. PMID 22924882.
 ↑ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TDGammon". Communications of the ACM. 38 (3). doi:10.1145/203330.203343. Retrieved 20100208.
 ↑ Hado van Hasselt. Reinforcement Learning in Continuous State and Action Spaces. In: Reinforcement Learning: State of the Art, Springer, pages 207251, 2012
 ↑ Watkins, C.J.C.H., (1989), Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
 ↑ Watkins and Dayan, C.J.C.H., (1992), 'Qlearning.Machine Learning'
 ↑ Mnih, Volodymyr; et al. (2015). "Humanlevel control through deep reinforcement learning" (PDF). 518: 529–533.
 ↑ van Hasselt, Hado (2011). "Double Qlearning" (PDF). Advances in Neural Information Processing Systems. 23: 2613–2622.
 ↑ van Hasselt, Hado; Guez, Arthur; Silver, David (2015). "Deep reinforcement learning with double Qlearning" (PDF). AAAI Conference on Artificial Intelligence: 2094–2100.
 ↑ Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac modelfree reinforcement learning. In Proc. 22nd ICML 2006, pages 881–888, 2006.
 ↑ Hamid Maei, and Csaba Szepesvári, Shalabh Bhatnagar and Richard Sutton. Toward offpolicy learning control with function approximation. In proceedings of the 27th International Conference on Machine Learning, pages 719726, 2010.
 ↑ Csaba Szepesva ́ri. The asymptotic convergencerate of Qlearning. Advances in Neural Information Processing Systems 10, Denver, Colorado, USA, 1997.
 ↑ Gheshlaghi Azar, Mohammad; Munos, Remi; Ghavamzadeh, Mohammad; Kappen, Hilbert J. (2011). "Speedy QLearning" (PDF). Advances in Neural Information Processing Systems. 24: 2411–2419.
External links
 Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
 Strehl, Li, Wiewiora, Langford, Littman (2006). PAC modelfree reinforcement learning
 Reinforcement Learning: An Introduction by Richard Sutton and Andrew S. Barto, an online textbook. See "6.5 QLearning: OffPolicy TD Control".
 Piqle: a Generic Java Platform for Reinforcement Learning
 Reinforcement Learning Maze, a demonstration of guiding an ant through a maze using Qlearning.
 Qlearning work by Gerald Tesauro
 Qlearning work by Tesauro Citeseer Link
 Qlearning algorithm implemented in processing.org language
 Solution for the pole balancing problem with Q(lambda) / SARSA(lambda) and the fourier basis in javascript