Reinforcement Learning Policy Methods

1. Concepts of RL

Policy: A mapping from states to actions. An algorithm/rule to make decisions at each time step, designed to maximize the long-term reward.

Value function: A mapping from states to total reward. The total reward the agent can expect to accumulate in the future, starting from that state (if they are making optimal decisions).

Q-function: A mapping from states and actions to total reward.

2. Principle: Bellman Equation

  • Value function optimality:
\[v_*(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a]\]
  • Q-function optimality:
\[Q_*(s,a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} Q_*(S_{t+1}, a') \mid S_t = s, A_t = a]\]

q-learning

3. Deep Q-Learning

  • Strategy:

    • Objective:
    \[Q(s, a; \theta) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a'; \theta) \mid S_t = s, A_t = a]\]
    • Let $y_t$ be one step of “play”:
    \[y_t = R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a'; \theta_{old})\]
    • Adjust the parameters $\theta$ to make the squared error small (SGD):
    \[(y_t - Q(s, a; \theta))^2\]
    • Conduct SGD using backpropagation:
    \[\theta \leftarrow \theta + \eta (y_t - Q(s,a;\theta)) \nabla_\theta Q(s,a;\theta)\]
  • Replay buffer:

    • Prevent “forgetting” how to play early parts of a game
    • Remove correlations between nearby state transitions
    • Prevent cycling behavior, due to target changing

Learning takes place when expectations are violated. The receipt of the reward itself does not cause changes.

  • Automatic differentiation: Gradient Collection

4. Multi-Armed Bandits

  • The rewards are independent and noisy
  • Arm k has expected payoff $\mu_k$ with variance $\sigma^2_k$ on each pull
  • Each time step, pull an arm and observe the resulting reward
  • Played often enough, can estimate mean reward of each arm

alt text

5. Policy Iteration

  • Policy evaluation:

policy_evaluation

  • Policy improvement:

policy_improvement

6. Policy Gradient Methods

  • Parameterize the policy: $\pi_\theta(s)$

    Policy is probability distribution $\pi_\theta(a \vert s)$ over action $a$ given state $s$

  • Loss function: Expected reward $\mathbb{J}(\theta) = \mathbb{E}[R]$

    Let $\tau$ be a trajectory sequence:

    $(s_0,a_0) \rightarrow (s_1,r_1,a_1) \rightarrow (s_2,r_2,a_2) \rightarrow \cdots \rightarrow (s_T,r_T,a_T) \rightarrow s_{T+1}$, where $s_{T+1}$ is a terminal state.

    The objective function $\mathcal{J}(\theta)$ is defined as:

    \[\mathcal{J}(\theta) = \mathbb{E}_\theta[R(\tau)] = \mathbb{E}_\theta\left[\sum_{t=1}^T r_t\right]\]
  • Calculating the gradient:

    Using Markov property, calculate $E_\theta(R(\tau))$ as

    \[E_\theta(R(\tau)) = \int p(\tau \mid \theta) R(\tau) d\tau\] \[p(\tau \mid \theta) = \prod_{t=0}^T \pi_\theta(a_t \mid s_t) p(s_{t+1}, r_{t+1} \mid s_t, a_t)\]

    It follows that

    \[\nabla_\theta \log p(\tau \mid \theta) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) = \sum_{t=0}^T \dfrac{\nabla_\theta \pi_\theta(a_t \mid s_t)}{\pi_\theta(a_t \mid s_t)}\]

    Now we use

    \[\nabla_\theta J(\theta) = \nabla_\theta E_\theta R(\tau) = \nabla_\theta \int R(\tau) p(\tau \mid \theta) d\tau = \int R(\tau) \nabla_\theta p(\tau \mid \theta) d\tau\] \[= \int R(\tau) \dfrac{\nabla_\theta p(\tau \mid \theta)}{p(\tau \mid \theta)} p(\tau \mid \theta) d\tau = E_\theta \left[ R(\tau) \nabla_\theta \log p(\tau \mid \theta) \right]\]
  • Approximating the gradient:

    Since it’s an expectation, can approximate by sampling:

    \[\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \nabla_\theta \log p(\tau^{(i)} \mid \theta) \\ &= \frac{1}{N} \sum_{i=1}^N R(\tau^{(i)}) \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a^{(i)}_t \mid s^{(i)}_t) \\ &\equiv \widehat{\nabla_\theta J(\theta)} \end{aligned}\]

    The policy gradient algorithm is then

    \[\theta \leftarrow \theta + \eta \widehat{\nabla_\theta J(\theta)}\]