Barto's approach: the ASE-ACE combination

7.3 Barto's approach: the ASE-ACE combination

Barto, Sutton and Anderson (Barto et al., 1983) have formulated `reinforcement learning' as a learning strategy which does not need a set of examples provided by a `teacher.' The system described by Barto explores the space of alternative input-output mappings and uses an evaluative feedback (reinforcement signal) on the consequences of the control signal (network output) on the environment. It has been shown that such reinforcement learning algorithms are implementing an on-line, incremental approximation to the dynamic programming method for optimal control, and are also called `heuristic' dynamic programming (Werbos, 1990).

The basic building blocks in the Barto network are an Associative Search Element (ASE) which uses a stochastic method to determine the correct relation between input and output and an Adaptive Critic Element (ACE) which learns to give a correct prediction of future reward or punishment (Figure 7.2). The external reinforcement signal r can be generated by a special sensor (for example a collision sensor of a mobile robot) or be derived from the state vector. For example, in control applications, where the state s of a system should remain in a certain part

of the control space, reinforcement is given by:

7.3.1 Asso ciativ

e searc

In its most elementary form the ASE gives a binary output value o ( ) 0 1 as a stochastic function of an input vector. The total input of the ASE is, similar to the neuron presented in chapter 2, the weighted sum of the inputs, with the exception that the bias input in this case is

a stochastic variable with mean zero normal distribution:

The activation function is a threshold such that

1 if s ( t ) > 0,

0 otherwise.

CEMENT LEARNING

decoder o

ASE

system

state vector

Figure 7.2: Architecture of a reinforcement learning scheme with critic element

For updating the weights, a Hebbian type of learning rule is used. However, the update is

weighted with the reinforcement signal r ( t ) and an `eligibility' e j

( ) j ( ) of input and output:

where is a learning factor. The eligibility j is given by

is high if the signals from the input state unit j and the output unit are correlated over some time.

Using ( is an external reinforcement signal. Instead of r ( t ), usually a continuous in ternal reinforcement

rt

signal ^ ( ) given by the ACE, is used. Barto and Anandan (Barto & Anandan, 1985) proved convergence for the case of a single

rt

binary output unit and a set of patterns In control applications, the input vector is the ( n -dimensional) state vector s of the system. In order to obtain a linear

independent set of patterns , often a `decoder' is used, which divides the range of each of the

input variables i in a number of intervals. The aim is to divide the input (state) space in a number of disjunct subspaces (or `boxes' as called by Barto). The input vector can therefore only be in one subspace at a time. The decoder converts the input vector into a binary valued vector x , with only one element equal to one, indicating which subspace is currently visited. It has been shown (Kr ose & Dam, 1992) that instead of a-priori quantisation of the input space,

a self-organising quantisation, based on methods described in this chapter, results in a better performance.

7.3.2 Adaptiv e critic

The Adaptive Critic Element (ACE, or `evaluation network') is basically the same as described in

(in this case denoted by !) and is used for training the ACE:

7.3. BARTO'S APPROACH: THE ASE-ACE COMBINATION

( ) is implemented as a series of `weights' C j to the ACE such that

pt

(7.14) if the system is in state k at time t , denoted by x k = 1. The function is learned by adjusting

( t )= w C k

the C j

's according to a `delta-rule' with an error signal given by ^ ( ):

( )= ^ ( ) j ( )

is the learning parameter and h j ( t ) indicates the `trace' of neuron x j :

increases while state j is active and decays exponentially after the activity of j has expired.

If ^ ( ) is positive, the action of the system has resulted in a higher evaluation value, whereas

a negative ^ r ( t ) indicates a deterioration of the system. ^ r ( t ) can be considered as an internal reinforcement signal.

rt

7.3.3 The cart-pole system

controller must control the cart in such a way that the pole always stands up straight. The

controller applies a `left' or `right' force direction at discrete time intervals. The model has four state variables:

the position of the cart on the track, the angle of the pole with the vertical,

_ the cart velocity, and _ the angle velocity of the pole.

Furthermore, a set of parameters specify the pole length and mass, cart mass, coe!cients of friction between the cart and the track and at the hinge between the pole and the cart, the control force magnitude, and the force due to gravity. The state space is partitioned on the basis of the following quantisation thresholds:

1. : 0 8 2 4m,

3. _ : 0 5 m/s,

4. _ : 50 /s. This yields 3 6 3 3 = 162 regions corresponding to all of the combinations of the intervals.

The decoder output is a 162-dimensional vector. A negative reinforcement signal is provided

when the state vector gets out of the admissible range: when

12 . The system has proved to solve the problem in about 75 learning steps.

CEMENT LEARNING

Figure 7.3: The cart-pole system.