Reinforcement Learning Policy Methods
1. Concepts of RL
Policy: A mapping from states to actions. An algorithm/rule to make decisions at each time step, designed to maximize the long-term reward.
Value function: A mapping from states to total reward. The total reward the agent can expect to accumulate in the future, starting from that state (if they are making optimal decisions).
Q-function: A mapping from states and actions to total reward.
2. Principle: Bellman Equation
- Value function optimality:
- Q-function optimality:
3. Deep Q-Learning
-
Strategy:
- Objective:
- Let $y_t$ be one step of “play”:
- Adjust the parameters $\theta$ to make the squared error small (SGD):
- Conduct SGD using backpropagation:
-
Replay buffer:
- Prevent “forgetting” how to play early parts of a game
- Remove correlations between nearby state transitions
- Prevent cycling behavior, due to target changing
Learning takes place when expectations are violated. The receipt of the reward itself does not cause changes.
- Automatic differentiation: Gradient Collection
4. Multi-Armed Bandits
- The rewards are independent and noisy
- Arm k has expected payoff $\mu_k$ with variance $\sigma^2_k$ on each pull
- Each time step, pull an arm and observe the resulting reward
- Played often enough, can estimate mean reward of each arm
5. Policy Iteration
- Policy evaluation:
- Policy improvement:
6. Policy Gradient Methods
-
Parameterize the policy: $\pi_\theta(s)$
Policy is probability distribution $\pi_\theta(a \vert s)$ over action $a$ given state $s$
-
Loss function: Expected reward $\mathbb{J}(\theta) = \mathbb{E}[R]$
Let $\tau$ be a trajectory sequence:
$(s_0,a_0) \rightarrow (s_1,r_1,a_1) \rightarrow (s_2,r_2,a_2) \rightarrow \cdots \rightarrow (s_T,r_T,a_T) \rightarrow s_{T+1}$, where $s_{T+1}$ is a terminal state.
The objective function $\mathcal{J}(\theta)$ is defined as:
\[\mathcal{J}(\theta) = \mathbb{E}_\theta[R(\tau)] = \mathbb{E}_\theta\left[\sum_{t=1}^T r_t\right]\] -
Calculating the gradient:
Using Markov property, calculate $E_\theta(R(\tau))$ as
\[E_\theta(R(\tau)) = \int p(\tau \mid \theta) R(\tau) d\tau\] \[p(\tau \mid \theta) = \prod_{t=0}^T \pi_\theta(a_t \mid s_t) p(s_{t+1}, r_{t+1} \mid s_t, a_t)\]It follows that
\[\nabla_\theta \log p(\tau \mid \theta) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) = \sum_{t=0}^T \dfrac{\nabla_\theta \pi_\theta(a_t \mid s_t)}{\pi_\theta(a_t \mid s_t)}\]Now we use
\[\nabla_\theta J(\theta) = \nabla_\theta E_\theta R(\tau) = \nabla_\theta \int R(\tau) p(\tau \mid \theta) d\tau = \int R(\tau) \nabla_\theta p(\tau \mid \theta) d\tau\] \[= \int R(\tau) \dfrac{\nabla_\theta p(\tau \mid \theta)}{p(\tau \mid \theta)} p(\tau \mid \theta) d\tau = E_\theta \left[ R(\tau) \nabla_\theta \log p(\tau \mid \theta) \right]\] -
Approximating the gradient:
Since it’s an expectation, can approximate by sampling:
\[\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \nabla_\theta \log p(\tau^{(i)} \mid \theta) \\ &= \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a^{(i)}_t \mid s^{(i)}_t) \\ &\equiv \widehat{\nabla_\theta J(\theta)} \end{aligned}\]The policy gradient algorithm is then
\[\theta \leftarrow \theta + \eta \widehat{\nabla_\theta J(\theta)}\]