Reinforcement Learning: Introduction to RL with K-Bandit Problem
Published:
Reinforcement learning (RL) is learning what to do in different situations as to maximize a numerical reward. Now, what distinguishes RL from other types of learning is - learner is not told which actions to take, but instead learner must discover which action / series of actions will yield the most cumulative reward in some time-period. For example, a master chess player makes a move. The choice is informed by planning, anticipating possible moves and countermoves of the opponent and by judgement of the desirability of particular pieces on the board.
To further understand the fundamentals of reinforcement learning we will look at a simplified setting, where state / environment does not change with actions or time. This particular problem can be explored using a simple version of k-armed bandit problem.
K-Bandit Problem:
K-armed bandit problem originates from a slot machine, where instead of one lever you have k-levers and on pulling each lever you recieve a reward (tickets). Through repeated actions (lever pulling) you are to maximize your earnings by concentrating your actions on the best lever.
In our k-armed bandit problem, each of the k-actions has an expected or mean reward given that action is selected also called as value of that action, denoted by q*(a). However, we never know the true value of an action but just an estimate or action-value, denoted by Qt(a). Through repetively choosing an action we update our estimate, and the goal of learning is to get our estimate close to true value. How do we update our estimate, each time we perform an action?
Update rule: \(NewEstimate <- OldEstimate + StepSize*[Target-OldEstimate]\)
Here, the expression [Target-OldEstimate] is an error in the estimate, and it is reduced by taking a step towards the “Target” or received “Reward” at that instant. Also, we need to assume or initialize the estimate of an action at t=0.
Now, that we have an estimate value of an action at any time t, how can we use this estimate to guide our action selection or which lever to pull. Simplest action selection rule is to select one of the actions with highest estimated value. We call this greedy action / exploiting. If instead you select one of the non-greedy action to better have an estimate of non-greedy actions, then we say that we are exploring. At any given time, to maximize rewards over long-time, it is best to balance between exploration and exploitation.
Methods for estimating action-value:
As observed in the update rule, StepSize have a huge impact on how the action-value is updated - whether we are giving more value to recent rewards/past rewards or same value to each reward. Further, to view this the update rule can be written as following:
\[\begin{aligned} Q_{n+1}= & Q_n+\alpha\left[R_n-Q_n\right] \\ = & \alpha R_n+(1-\alpha) Q_n \\ = & \alpha R_n+(1-\alpha)\left[\alpha R_{n-1}+(1-\alpha) Q_{n-1}\right] \\ = & \alpha R_n+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^2 Q_{n-1} \\ = & \alpha R_n+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^2 \alpha R_{n-2}+ \\ & \cdots+(1-\alpha)^{n-1} \alpha R_1+(1-\alpha)^n Q_1 \\ = & (1-\alpha)^n Q_1+\sum_{i=1}^n \alpha(1-\alpha)^{n-i} R_i . \end{aligned}\]Note: In this the weight, associated with reward $R_{i}$ depends on how many rewards ago, $n-i$, it was observed. So, assuming $n$ is large, then if $\alpha$ remains constant, $(1-\alpha)^{n-i}$ will tends to zero suggesting value associated with reward long-back in the past is 0, while if $\alpha$ decreases with time as $1/n$ then $(1-\alpha)^{n-i}$ will be non-zero, suggesting some value associated even with reward long-back in the past. Now, when it is better to use constant or decreasing $\alpha$ depends on the reinforcement learning problem.
Tracking a stationary problem: Stationary problem can be considered as situations where the reward probabilities do not change over time. Eg. Evaluating the value of which medicine is best among k-choices. Since, the reaction of a medicine with a particular particle does not fast enough with time, its reward for treatment can be considered same. In this case the choosing a changing step-size of $1/n$ (n is number of times an action is selected) ensures that reward at each time-step gets equal value.
Tracking a non-stationary problem: In reinforcement learning, we mostly encounter problems that are effectively Non-Stationary. These can be considered as situation where the reward associated to particular action changes with time. In such cases, it makes sense to give more weight to recent rewards than to long-past rewards when estimating action-value. One of the most popular way of doing this is to use a constant step-size parameter.
Balancing between exploration and exploitation
Greedy actions always exploits current knowledge to maximize immediate reward; it spends no time to sample inferior actions to see if they might really be better. In non-stationary problem there is a possibility that nongreedy actions may change to become better than the greedy one, and since non-stationary problems are more common in real-world, it becomes necessary to explore. There are many sophisticated methods for balancing exploration and exploitation, yet two simpler ones are defined below.
Near-greedy action selection $(\epsilon-greedy)$: In this approach, we behave greedily most of the time, but every once in a while, say with small probability $\epsilon$, we select randomly from among all actions with equal probabilty, independently of action-value estimates.
Optimistic Initial Values: Any method in reinforcement learning is always dependent to some extent on the initial action-value estimate, $Q_1(a)$. However, in the long-run this bias fades off with increasing number of steps ($(1-\alpha)^n Q_1$). But, this can be exploited to force the model do exploration even with greedy action, during initial steps. Eg. Suppose, there are 5 actions each with Q(0)=5 and R=1, then $Q(1) = 5+1/2*(1-5) -> Q(1)=3$. Now, greedy action will choose another action, despite of this action getting a reward of +1.
Upper-Confidence-Bound action selection: $\epsilon-greedy$ action selection forces the non-greedy actions to be tried but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. So another approach for selection is to choose actions according to their potential of being optimal - taking into account both how close the estimates are to being maximal and also the uncertainties in those estimates.
The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in thee stimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with c determining the confidence level.
In K-bandits problem, we had an evaluative feedback of taking each action, however in real cases actions have an associative effect as well - choosing different actions leads to different situations thus affecting future rewards. To solve this problem we need Markov Decision Process - a formalization of sequential decision making.