Busoniu, DeSchutter, Babuska(2010) (Approximate dynamic programming and reinforcement learning)

12 Apr, 2016 · Read in about 4 min · (814 Words)

This paper provides an overview of various solution concepts that belong to the families of approximate dynamic programming and reinforcement learning.

I will discuss an alternative representation of a dynamic programming that is often used in this literature and provide an overview of some solution methods that are common with this representation.

Basics: Q-functions

In economics, the standard representation of a dynamic programming problem is to define a recursive function V that maps from a state space into the real line. Today I will call V the V function. The V function describes the value of being in a particular state.

Consider an alternative function Q that maps from the state-action space, into the real line. I will call this the Q-function. The Q function describes the value of being in a particular state and choosing a particular action.

As we do with the V-function, we can express the Q-function recursively using Bellman’s principle of optimality. The recursive form of the Q-function is

Q(s, x) = E_{s′ ∼ f(s, x, ⋅ )}[r(s,x,s′)+βmax_x′Q(s′,x′)]

We can make some basic comparisons between Q and V by studying their recursive representations:

Parameteric representations (i.e. splines or other interpolands) of Q are more expensive than representations of V because Q is defined over the state-action space instead of just the state space.
Computing the maximization on the right hand side of Q is easier than the max on the RHS of V. This is because the expectation is outside the max in Q, but inside the max for V.

The Bellman operator associated with the Q function is a contraction mapping under the same conditions that Bellman operator associated V is a contraction mapping. When Q’s operator is a contraction mapping, the operators’ unique fixed point Q^* is the optimal Q function.

The optimal policy can be easily computed from Q^*: $h^* (s) = \underset{x}{\text{argmax}} Q^* (s, x)$

The optimal V-function can be obtained from Q^*: $V^* = \underset{x}{\max} Q^* (s, x)$.

Computing Q

We see that if we can find Q^*, we can get the value and policy functions we care about as economists. I want to briefly describe a few flavors of algorithms that can be used to compute Q^*

Q-learning

Q-learning is classified as an online, model-free algorithm. This means you can update your guess for Q^* whenever new data becomes available. Economists would classify this as a simulation algorithm.

Here’s the basics:

Start with any initial guess for the Q function
Then on each iteration k:
- Given the state action pair on the ith iteration ((s_i, x_i)), the observation of the period i return r_i, and the state iteration i + 1 s_i + 1…
- … update your guess for Q using:
  Q_i + 1(s_i, x_i) = (1 − α_i)Q_i(s_i, x_i) + α_i(r_i+βmax_x′Q_i(s_i + 1,u′))

Notice that the right hand side is a convex combination of the current guess of the Q function and the right hand side of the Q-function operator. This is very similar to standard value function iteration, but has the main advantage of not having any expectations.

This can be applied to any model for which you have a transition function that defines the state today as a function of the state and action yesterday and a random innovation from a distribution you can sample. Many economic models fit into this framework.

When the action space is continuous, you need do a modest alteration of the basic algorithm to ensure asymptotic convergence to Q^*.

Approximate Q-iteration

One strength of this literature is the number of different ways of reducing the dimensionality of a dynamic programming problem.

Approximate Q-iteration is one example of this type of algorithm. Here instead of operating on the Q-function directly, we instead operate on a parameterized approximation of the Q-function.

This can be done using linear dependence on coefficients (splines, Chebyshev polynomials) or non-linear parametric approximations (neural nets).

We’ll focus on the linearly parametrized version here. We will form an approximation $\hat{Q}(s, x) = \sum_{l=1}^n \phi_l(s, x) \theta_l = \Phi\theta$.

We and talk about a variant of the Q-learning we previously discussed that uses gradient information to update the coefficient vector.

Here’s the basic idea:

Start with any guess of coefficients θ
Then on each iteration k:
- Given the state action pair on the ith iteration ((s_i, x_i)), the observation of the period i return r_i, and the state iteration i + 1 s_i + 1…
- … update your guess for θ using:
  $$\theta_{i+1} = \theta_i + \alpha_i \left[r_i + \beta \max_{x'} \hat{Q}_i(s_{i+1}, x') - \hat{Q}_i(s_i, x_i)\right] \frac{\partial}{\partial \theta_i} \hat{Q}_i(s_i, x_i)$$
- In the linear parameterization world we are in this becomes
  θ_i + 1 = θ_i + α_i[r_i+βmax_x′(Φ(s_i + 1,x′)θ_i)−Φ(s_i,x_i)θ_i]Φ(s_i, x_i)

As this is a Q-learning algorithm, the same computational benefits arise here, but now we additionally have the benefit of a potentially vastly reduced computational problem depending on the state-action space the choice of basis functions.