5 Mar 2024

Deep Reinforcement Learning 03 (Intro to RL)

Keypoints:

Agent: The learner or decision maker that interacts with the environment.
Environment: The external system with which the agent interacts.
State: A representation of the environment at a given time.
Action: A move made by the agent.
Reward: A scalar feedback signal that indicates which states and actions are good or bad. $r(s_t, a_t)$
- Delayed reward: The reward may not be immediate but delayed. This is the core challenge in RL.
Policy: A mapping from states to actions, i.e $ \pi_{\theta}(a_t o_t)$.
Value function: A prediction of future rewards.
Observation: The data received by the agent from the environment.
- Fully observable: The agent observes all relevant information.
  - $ \pi_{\theta}(a_t s_t)$ where we use $s_t$ to represent the state because the agent can observe all the information.
- Partially observable: The agent observes only some relevant information.
Markov Decision Process (MDP): $s$, $a$, $r(s,a)$, $p(s’ s,a)$ defines the Markov Decision Process (MDP)
Markov chain: $M = (S, T)$ where $S$ is the state space and $T$ is the transition matrix, i.e $P(s_{t+1} s_t)$ the probability of transitioning from state $s_t$ to state $s_{t+1}$.
Partially observable MDP:

Basic outline:
- Generate samples (run the policy)
- Estimate the return / fit a model
  - $J(\theta) = \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r_t] \approx \frac{1}{N} \sum_{i=1}^N \sum_t r_t^{(i)}$
- Improve the policy
  - $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$
- Repeat
- That’s to say “Run your policy, get a trajectory, estimate the return, and then improve the policy.”
RL by backprop:
- Generate samples
- Estimate the return
  - Learn $f_{\phi}$ such that $s_{t+1} \approx f_{\phi}(s_t, a_t)$
- Improve the policy
  - Backprop through $f_{\phi}$ and $r$ to train $\pi_{\theta}(s_t)=a_t$
Q-function:
- $Q^{\pi}(s_t, a_t) = \sum_{t’=t}^T \mathbb{E}{\pi{\theta}}[r(s_{t’}, a_{t’}) s_t, a_t]$:total reward from taking action $a_t$ in state $s_t$.
  - Value function: $V^{\pi}(s_t) = \mathbb{E}{a_t \sim \pi{\theta}}[Q^{\pi}(s_t, a_t)]$
  - $\mathbb{E}_{s_1~p(s_1)}[V^{\pi}(s_1)]$ is the RL objective
- Using Q-function and value functions
Types of RL algo: $\theta^* = \arg \max_{\theta} \mathbb{E}{\tau \sim \pi{\theta}}[\sum_t r(s_t,a_t)]$
- Policy gradients: direct optimization of the policy
- Value-based: estimate the value function or Q-function of the optimal policy (no explicit policy)
- Actor-critic: estimate value function or Q-function of the current policy and use it to improve the policy
- Model-based: estimate the transition model and use it to…
  - Plan
  - improve the policy

0 comments