Konda and Tsitsiklis (2003) (Actor-critic Algorithms)

19 Apr, 2016 · Read in about 4 min · (741 Words)

Background

When a Markov decision process (MDP) is formulated as a dynamic programming problem, the reinforcement learning literature proposes are two classic classes of algorithms to solve them. Let’s briefly review these types of algorithms and point out strengths and weaknesses of each.

1: Actor only methods

We can think of an actor as a fictitious character that operates on a policy rule.

When I talk about the performance of an actor, I mean the value of following a policy.

These methods are often implemented by estimating the gradient of the performance of an actor using simulation.

There are two main issues with these “policy iteration”-esque algorithms:

Gradient estimators can have high variance
As the policy changes, a new gradient is estimated independently. This means there is no sense of learning from past data.

2: Critic only methods

We can think of a critic operating on either a Q or V value function.

These methods rely exclusively on value function approximation and try to learn the solution to Bellman’s equation.

The main issues with critic only methods are:

They are indirect in that they do not try to optimize directly over the policy space
Even with an accurate approximation of the value function, results that guarantee the near-optimality of the corresponding policy are difficult to guarantee.

Main idea

This paper suggests two actor-critic hybrid methods that aim to maintain the best features of each algorithm, while overcoming the shortcomings mentioned above.

The main idea behind the algorithms is that the critic uses a linearly parameterized approximation scheme and simulation to learn a value function. The actor then uses the learned information to update parameters on the policy function in a direction of performance improvement.

Aside to tie back to econ: This feels like modified policy iteration or Howard’s improvement algorithm, but it is different in a few ways:

There is a learning element to these algorithms, which means we don’t have to compute expectations explicitly.
We will be learning Q functions, which describe the value of being in a state and taking any feasible action (instead of the V function that describes the value of being in a state and choosing the optimal action).

Algorithms

The presentation is very technical and relies on assumptions that aren’t necessarily applicable to the models we write down, so I won’t the paper exactly as it was written. Instead, I will sketch the algorithm and explain the key insight the authors have that makes the algorithm tractable.

Setup

We will represent the critic using three variables:

A coefficient vector of length m that describe a linear parametrization of Q in terms of basis functions.
A scalar α that represents the average value of following the actor’s policy
A vector of length m that represents Sutton’s eligibility trace. This vector is used to form a bridge between fixed point methods and Monte Carlo methods

The actor is represented by a suitable parametric representation of the policy function.

Algo

In order to understand the algorithm, I need to provide two related definitions. A temporal difference is the difference between the current approximation of a variable and a realization of that variable. In other words it is the error in our approximation for a particular sample.

We say we update parameters or approximations using a temporal difference if the new approximation is the sum of the current approximation and a scaled temporal difference.

The algorithm roughly proceeds as follows:

Initialize the actor and critic
Perform one step updates of the actor and critic as follows:
- Because we learn Q, we enter a time period with a particular state and action in hand
- The actor will dictate a new action and we need to simulate a new state, potentially given that action
- The critic’s parameters are updated according to:
  - Average value of policy: temporal difference update (using flow implied by state and action)
  - Coefficient vector for Q: temporal difference update, scaled by eligibility trace
  - Eligibility trace:
- The actor’s parameter vector is updated using a gradient approach that resembles Newton’s method. It takes into account updates to the actor

The key insight the authors have that make this algorithm tractable is the following:

Actors have to update a small number of parameters compared to the number of states. So the critic doesn’t need to form an approximation over the entire domain of Q, but rather a special projection of the Q onto the space spanned by the actor’s parameter vector.