Boltzmann machines

5.3 Boltzmann machines

networks to include hidden units, and with a stochastic instead of deterministic update rule. The weights are still symmetric. The operation of the network is based on the physics principle

of . This is a process whereby a material is heated and then cooled very, very slowly to

a freezing point. As a result, the crystal lattice will be highly ordered, without any impurities, such that the system is in a state of very low energy. In the Boltzmann machine this system is mimicked by changing the deterministic update of equation (5.2) in a stochastic update, in which a neuron becomes active with a probability p ,

where is a parameter comparable with the (synthetic) temperature of the system. This stochastic activation function is not to be confused with neurons having a sigmoid deterministic activation function.

In accordance with a physical system obeying a Boltzmann distribution, the network will eventually reach `thermal equilibrium' and the relative probability of two global states and

will follow the Boltzmann distribution

( E ;E ) =T

where P is the probability of being in the global state, and E is the energy of that state. network in any global state remains constant.

th

At low temperatures there is a strong bias in favour of states with low energy, but the time required to reach equilibrium may be long. At higher temperatures the bias is not so

high temperature and gradually reduce it. At high temperatures, the network will ignore small

temperature. As multi-layer perceptrons, the Boltzmann machine consists of a non-empty set of visible and a possibly empty set of hidden units. Here, however, the units are binary-valued and are updated stochastically and asynchronously. The simplicity of the Boltzmann distribution leads to a simple learning procedure which adjusts the weights so as to use the hidden units in an optimal way (Ackley et al., 1985). This algorithm works as follows.

First, the input and output vectors are clamped. The network is then annealed until it librium and each connection measures the fraction of the time during which both the units it

connects are active. This is repeated for all input-output pairs so that each connection can

measure j

, the expected probability, averaged over all cases, that units and are simultaneously active at thermal equilibrium when the input and output vectors are clamped.

clamped

hy

ki

jk

5.3. BOL TZMANN MA

CHINES

Similarly, j

free is measured when the output units are not clamped but determined by the network.

hy

ki

In order to determine optimal weights in the network, an error function must be determined.

Now, the probability ( ) that the visible units are in state p when the system is running

free y p

( ) that the visible units are in state y is determined by clamping the visible units and letting the network run. Now, if the weights in the network are correctly set, both probabilities are equal to each other, and the

freely can be measured. Also, the

probability clamped

desired

error E in the network must be 0. Otherwise, the error must have a positive value measuring `asymmetric divergence' or `Kullback information' is used:

clamped y p

E = P clamped y p

( )log P free y p

Now, in order to minimise using gradient descent, we must change the weights according to

(5.14) It is not di!cult to show that

1 hy

(5.15) Therefore, each weight is updated by

clamped ; hy j y k i free :

clamped ; hy j y k i free :

hy

CHAPTER

5. RECURRENT NETW ORKS

6 Self-Organising

Net w orks

In the previous chapters we discussed a number of networks which were trained to perform a

d p F x mapping p : by presenting the network `examples' ( ) with = ( ) of this mapping. However, problems exist where such training data, consisting of input and desired

output pairs are not available, but where the only information is provided by a set of input patterns x p . In these cases the relevant information has to be found within the (redundant)

training samples p . Some examples of such problems are:

clustering: the input data may be grouped in `clusters' and the data processing system vector quantisation: this problem occurs when a continuous space has to be discretised.

The input of the system is the n -dimensional vector x , the output is a discrete repre- dimensionality reduction: the input data are grouped in a subspace which has lower di-

mensionality than the dimensionality of the data. The system has to learn an optimal feature extraction: the system has to extract features from the input signal. This often

means a dimensionality reduction as described above. In this chapter we discuss a number of neuro-computational approaches for these kinds of

problems. Training is done without the presence of an external teacher. The unsupervised weight adapting algorithms are usually based on some form of global competition between the neurons.

There are very many types of self-organising networks, applicable to a wide area of problems. One of the most basic schemes is competitive learning as proposed by Rumelhart and Zipser

is the topology-conserving map devised by Kohonen. Other self-organising networks are ART, Fukushima's cognitron (Fukushima, 1975, 1988).