5 Mar 2024
Deep Reinforcement Learning 03 (Intro to RL)
Deep Reinforcement Learning 03 (Intro to RL)

Keypoints:
- Definitions
- The goal of Reinforcement Learning
- Algorithms
Definitions

- Agent: The learner or decision maker that interacts with the environment.
- Environment: The external system with which the agent interacts.
- State: A representation of the environment at a given time.
- Action: A move made by the agent.
- Reward: A scalar feedback signal that indicates which states and actions are good or bad. $r(s_t, a_t)$
- Delayed reward: The reward may not be immediate but delayed. This is the core challenge in RL.
-
Policy: A mapping from states to actions, i.e $ \pi_{\theta}(a_t | o_t)$. |
- Value function: A prediction of future rewards.
- Observation: The data received by the agent from the environment.
- Fully observable: The agent observes all relevant information.
-
$ \pi_{\theta}(a_t | s_t)$ where we use $s_t$ to represent the state because the agent can observe all the information. |
- Partially observable: The agent observes only some relevant information.
-
Markov Decision Process (MDP): $s$, $a$, $r(s,a)$, $p(s’ | s,a)$ defines the Markov Decision Process (MDP)  |
-
Markov chain: $M = (S, T)$ where $S$ is the state space and $T$ is the transition matrix, i.e $P(s_{t+1} | s_t)$ the probability of transitioning from state $s_t$ to state $s_{t+1}$. |
- Partially observable MDP:

The goal of Reinforcement Learning
Algorithms
- Basic outline:
- Generate samples (run the policy)
- Estimate the return / fit a model
- $J(\theta) = \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r_t] \approx \frac{1}{N} \sum_{i=1}^N \sum_t r_t^{(i)}$
- Improve the policy
- $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$
- Repeat
- That’s to say “Run your policy, get a trajectory, estimate the return, and then improve the policy.”
- RL by backprop:
- Generate samples
- Estimate the return
- Learn $f_{\phi}$ such that $s_{t+1} \approx f_{\phi}(s_t, a_t)$
- Improve the policy
- Backprop through $f_{\phi}$ and $r$ to train $\pi_{\theta}(s_t)=a_t$
- Q-function:
-
$Q^{\pi}(s_t, a_t) = \sum_{t’=t}^T \mathbb{E}{\pi{\theta}}[r(s_{t’}, a_{t’}) | s_t, a_t]$:total reward from taking action $a_t$ in state $s_t$. |
- Value function: $V^{\pi}(s_t) = \mathbb{E}{a_t \sim \pi{\theta}}[Q^{\pi}(s_t, a_t)]$
- $\mathbb{E}_{s_1~p(s_1)}[V^{\pi}(s_1)]$ is the RL objective
- Using Q-function and value functions
- Types of RL algo: $\theta^* = \arg \max_{\theta} \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r(s_t,a_t)]$
- Policy gradients: direct optimization of the policy
- Value-based: estimate the value function or Q-function of the optimal policy (no explicit policy)
- Actor-critic: estimate value function or Q-function of the current policy and use it to improve the policy
- Model-based: estimate the transition model and use it to…