Skip to main content

Q-learning algorithm

Introduction

It use utility table for Q-Values.

The best part of Q-Learning: it guarantees to provide a optimal policy.

What's Q?

Q means the function that the algorithm computes.

Q[s,a]=Q[s,a]= immediate rewards + discounted rewards

  • Short term rewards: Daily return
  • Long term rewards: cumulative return

How to use Q?

Π(s)=argmaxa(Q[s,a])\Pi(s)=argmax_a(Q[s,a])

The optimal:

Π(s)=argmaxa(Q[s,a])\Pi^*(s)=argmax_a(Q^*[s,a])

Update Rule

Q[s,a]Q'[s,a] =(1α)Q[s,a](1-\alpha)Q[s,a] + α\alpha * improved estimate

where Improved Estimate

= rr + γ\gamma * later rewards

= rr + γ\gamma * Q[s,argmaxa(Q[s,a])Q[s',argmax_a(Q[s',a'])

α\alpha : Learning rate [0, 1.0]

γ\gamma : discount rate [0, 1.0]

Qi1γQ_i^1 *\gamma

State

Can be used as state

  • Adjusted close/SMA
  • Bollinger Band Value
  • P/E Ratio
  • Holding stock
  • Return since entry

Creating the state

  • State is an integer
  • discretize each factor
  • combine all factors

Discretizing

Convert a real number into integer.

Summary

It's a model free algorithm that does not know Transition matrix T or rewards function.

Build a model

  • Define states, actions, rewards
  • Choose in-sample training period.
  • iterate: Q-Table update
  • back

Steps:

  1. Init Q table

  2. observe S

  3. execute a, obverse SS', r

  4. Update Q with

<s,a,s,r><s,a,s',r>

Testing a model

  • Backtest on later data.

Dyna-Q

Build up Transition matrix T and Rewards matrix R to speed up model convergences for Q-Learning.

The real world training is expensive, we hallucinate many additional interactions, 100 rounds.

Learning T

T[s,a,s]T[s,a,s'] = prob s,a>ss,a->s'

Init Tc[]T_c[] = 0.00001

while executing, observe s,a,s'

increment Tc[s,a,s]T_c[s,a,s']

T[s,a,s]=Tc[s,a,s]/(iT[s,a,i])T[s,a,s']=T_c[s,a,s']/(\sum_i T[s,a,i])

Learning R

R[s,a]=(1α)R[s,a]+αrR'[s,a]=(1-\alpha)R[s,a]+\alpha*r

r = immediate rewards.

R = expected reward for s,a.

Dyna-Q Algorithm

T[s,a,s]T'[s,a,s'] update

R[s,a]R'[s,a] update

  • s = random
  • a = random
  • s' = infer from T[]
  • r = R[s,a]

Update Q with <s,a,s,r><s,a,s',r>

Reference

Reinforcement Q-Learning from Scratch in Python with OpenAI Gym

Simple Reinforcement Learning: Q-learning

Q-Learning in Python

Dyna-Q-Learning

Learn about Queue in Python