- State $s_t$: state of the world at time $t$.
- Observation $o_t$: the agent’s observation at time $t$, in most of the real world cases, only observation is available, not state.
- Action $a_t$: the decision taken at time $t$.
- Trajectory $\tau$: sequences of states/observations and actions.
- Reward function $r(s,a).$
For simplicity, we usually model the next state as a function of the current state and action: $p(s_{t+1} | s_t, a_t)$. This is independent of $s_{t-1}$.
- Policy $\pi_{\theta}(a_t | s_t)$ models the action given state $s_t$. If only observations are available, we usually need more than one observations: $\pi_{\theta}(a_t|o_{t-m}, …, o_t)$.
- The overall probability $p_{\theta}(\tau) = p(s_1, a_1, ...,s_T,a_T)=p(s_1)\prod_1^T \pi_{\theta}(a_t|s_t)p(s_{t+1}|s_t,a_t)$.
- The goal of reinforcement learning is to find parameter $\theta$ to maximize the expected value of sum of rewards: $\max_{\theta} E[\sum_t^T r(s_t, a_t)]$
There are two more definitions used very common:
- Value function $V^{\pi}(s)$: future expected reward starting at $s$ following $\pi$.
- Q-function $Q^\pi(s,a)$: future expected reward starting at $s$, taking $a$, following $\pi$.
- Their relationship: $V^\pi(s)=\sum_a \pi(a|s)Q^\pi(s,a)$.