Q-learning algorithm
Introduction
It use utility table for Q-Values.
The best part of Q-Learning: it guarantees to provide a optimal policy.
What's Q?
Q means the function that the algorithm computes.
immediate rewards + discounted rewards
- Short term rewards: Daily return
- Long term rewards: cumulative return
How to use Q?
The optimal:
Update Rule
= + * improved estimate
where Improved Estimate
= + * later rewards
= + *
: Learning rate [0, 1.0]
: discount rate [0, 1.0]
State
Can be used as state
- Adjusted close/SMA
- Bollinger Band Value
- P/E Ratio
- Holding stock
- Return since entry
Creating the state
- State is an integer
- discretize each factor
- combine all factors
Discretizing
Convert a real number into integer.
Summary
It's a model free algorithm that does not know Transition matrix T or rewards function.
Build a model
- Define states, actions, rewards
- Choose in-sample training period.
- iterate: Q-Table update
- back
Steps:
-
Init Q table
-
observe S
-
execute a, obverse , r
-
Update Q with
Testing a model
- Backtest on later data.
Dyna-Q
Build up Transition matrix T and Rewards matrix R to speed up model convergences for Q-Learning.
The real world training is expensive, we hallucinate many additional interactions, 100 rounds.
Learning T
= prob
Init = 0.00001
while executing, observe s,a,s'
increment
Learning R
r = immediate rewards.
R = expected reward for s,a.
Dyna-Q Algorithm
update
update
- s = random
- a = random
- s' = infer from T[]
- r = R[s,a]
Update Q with
Reference
Reinforcement Q-Learning from Scratch in Python with OpenAI Gym