TD prediction (policy Evaluation)
Sarsa: On-Policy TD Control
Q-learning: Off-Policy TD Control
R-learning: for Undiscounted Continual Tasks