5 Mar 2024

Deep Reinforcement Learning 03 (Intro to RL)

Deep Reinforcement Learning 03 (Intro to RL)

image

Keypoints:

  • Definitions
  • The goal of Reinforcement Learning
  • Algorithms

Definitions

drawing
  • Agent: The learner or decision maker that interacts with the environment.
  • Environment: The external system with which the agent interacts.
  • State: A representation of the environment at a given time.
  • Action: A move made by the agent.
  • Reward: A scalar feedback signal that indicates which states and actions are good or bad. $r(s_t, a_t)$
    • Delayed reward: The reward may not be immediate but delayed. This is the core challenge in RL.
  • Policy: A mapping from states to actions, i.e $ \pi_{\theta}(a_t o_t)$.
  • Value function: A prediction of future rewards.
  • Observation: The data received by the agent from the environment.
    • Fully observable: The agent observes all relevant information.
      • $ \pi_{\theta}(a_t s_t)$ where we use $s_t$ to represent the state because the agent can observe all the information.
    • Partially observable: The agent observes only some relevant information.
  • Markov Decision Process (MDP): $s$, $a$, $r(s,a)$, $p(s’ s,a)$ defines the Markov Decision Process (MDP) image
  • Markov chain: $M = (S, T)$ where $S$ is the state space and $T$ is the transition matrix, i.e $P(s_{t+1} s_t)$ the probability of transitioning from state $s_t$ to state $s_{t+1}$.
  • Partially observable MDP:
drawing

The goal of Reinforcement Learning

drawing

Algorithms

  • Basic outline:
    • Generate samples (run the policy)
    • Estimate the return / fit a model
      • $J(\theta) = \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r_t] \approx \frac{1}{N} \sum_{i=1}^N \sum_t r_t^{(i)}$
    • Improve the policy
      • $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$
    • Repeat
    • That’s to say “Run your policy, get a trajectory, estimate the return, and then improve the policy.”
  • RL by backprop:
    • Generate samples
    • Estimate the return
      • Learn $f_{\phi}$ such that $s_{t+1} \approx f_{\phi}(s_t, a_t)$
    • Improve the policy
      • Backprop through $f_{\phi}$ and $r$ to train $\pi_{\theta}(s_t)=a_t$
  • Q-function:
    • $Q^{\pi}(s_t, a_t) = \sum_{t’=t}^T \mathbb{E}{\pi{\theta}}[r(s_{t’}, a_{t’}) s_t, a_t]$:total reward from taking action $a_t$ in state $s_t$.
      • Value function: $V^{\pi}(s_t) = \mathbb{E}{a_t \sim \pi{\theta}}[Q^{\pi}(s_t, a_t)]$
      • $\mathbb{E}_{s_1~p(s_1)}[V^{\pi}(s_1)]$ is the RL objective
    • Using Q-function and value functions
  • Types of RL algo: $\theta^* = \arg \max_{\theta} \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r(s_t,a_t)]$
    • Policy gradients: direct optimization of the policy
    • Value-based: estimate the value function or Q-function of the optimal policy (no explicit policy)
    • Actor-critic: estimate value function or Q-function of the current policy and use it to improve the policy
    • Model-based: estimate the transition model and use it to…
      • Plan
      • improve the policy

Tags:
0 comments