Appendix H Formal Models for Artificial Intelligence Methods: Mathematical Model of Neural Network Learning

When we discussed neural networks in Chap. 11 , we introduced the basic model of their learning, namely the back propagation method. In the second section of this appendix we present a formal justification for the principles of the method. In the first section we introduce basic notions of mathematical analysis [ 121 , 122 , 237 ] that are used for this justification.

H.1 Selected Notions of Mathematical Analysis

Firstly, let us introduce notions of vector space and normed vector space. Definition H.1 Let V be a nonempty set closed under an addition operation +, and

be a field. Let · be an external operation of the left-hand side multiplication, i.e. it is a mapping from K × V to V , where its result for a pair (a, w) ∈ K × V is denoted

a · w, briefly aw.

A vector space is a structure consisting of the set V , the field K, and operations +, ·, which fulfils the following conditions.

• The set V with the operation + is the Abelian group. • ∀a, b ∈ K, w ∈ V : a(bw) = (ab)w. • ∀a, b ∈ K, w ∈ V : (a + b)w = aw + bw. • ∀a ∈ K, w, u ∈ V : a(w + u) = aw + au. • ∀w ∈ V : 1 · w = w, where 1 is the identity element of multiplication in K.

Definition H.2 Let X be a vector space over a field K. A norm on X is a mapping " · " : X −→ R + fulfilling the following conditions.

• ∀x ∈ X: "x" = 0 ⇔ x = 0, where 0 is the zero vector in X. • ∀x ∈ X, λ ∈ K: "λx" = |λ| · "x". • ∀x, y ∈ X: "x + y" ≤ "x" + "y".

290 Appendix H: Formal Models for Artificial Intelligence Methods … Definition H.3 Let " · " be a norm on a vector space X. A pair (X, " · ") is called a

normed vector space . Further on, we assume X is a normed vector space.

Now, we can define directional derivative, partial derivative and gradient. Let U ⊂ X be an open subset of X.

Definition H.4

X . If there exists a limit of a difference quotient

then this limit is called a directional derivative of the function f along the vector v at the point a, denoted ∂ v f (a) .

Let X = R n , and vectors e 1 = (1, 0, 0, . . . , 0), e 2 = (0, 1, 0, . . . , 0), . . . , e n = ( 0, 0, 0, . . . , 1) constitute a canonical basis for a space X. Let U ⊂ X be an open subset of X.

Definition H.5 If there exist directional derivatives ∂ e 1 f (a), ∂ e 2 f (a), . . . , ∂ e n f (a) of a function f : U −→ R along vectors of the canonical basis e 1 , e 2 ,..., e n ,

then they are called partial derivatives of the function f at the point a, denoted ∂f

∂x 1 ∂x 2 ∂x n Let f : U −→ R be a function, where the set U ⊂ R n is an open set. Let us

∂f assume that there exist partial derivatives:

( a), . . . , ( a) at the point ∂x 1 ∂x 2 ∂x n

a ∈ U. Definition H.6

A vector

is called a gradient of the function f at the point a. Theorem H.1 At a given point, a directional derivative has the maximum absolute

value in the direction of the gradient vector. Thus, a function increases (or decreases) most rapidly in the gradient direction.

We will make use of this property in the next section.

H.2 Backpropagation Learning of Neural Networks

In this section we introduce a formalization of the backpropagation method of neural network learning [ 252 ], which was presented in an intuitive way in Chap. 11 . Firstly, let us discuss its general idea.

Appendix H: Formal Models for Artificial Intelligence Methods … 291 We learn a neural network, i.e. we modify its weights, in order to minimize an

error function of a classification of vectors belonging to the training set. All the weights of a neural network are variables of this function. Let us denote this function

with E(W), where W = (W 1 , W 2 ,..., W N ) is a vector of weights of all the neurons. At the j-th step of a learning process we have an error E(W(j)), briefly E(j). This

error will be minimized with the method of steepest descent, which can be defined in the following way.

(H.1) ∂E(j) ∂E(j)

W(j + 1) = W(j) − α∇E(W(j)),

∂E(j)

where ∇E(W(j)) =

is a gradient of the function

∂W 1 ( j) ∂W 2 ( j)

∂W N ( j)

E . Now, let us introduce denotations according to those used in Chap. 11 .N ( r)(k) denotes the k-th neuron of the r-th layer. Let us assume that a network consists of L layers, and the r-th layer consists of M r neurons. The output signal of the k-th neuron of the r-th layer at the j-th step of learning is denoted with y ( r)(k) ( j) . The input signal at the i-th input of the k-th neuron of the r-th layer at the j-th step of learning is denoted with X ( r)(k) ( j) , and the corresponding weight is denoted with W ( r)(k) i i ( j) .

Let us define a function E as a mean squared error function at the output of the network, i.e.

E(j) =

( u ( m) ( j)

2 −y

( L)(m)

where u ( m) ( j) is a required output signal for the m-th neuron of the L-th layer at the j -th step.

First of all, let us define a formula for a value of the i-th weight of the k-th neuron of the r-th layer at the (j + 1)-th step of learning. From formulas ( H.1 ) and ( H.2 ) we

obtain

( r)(k)

∂E(j)

W i ( j + 1) = W i

( j) −α

∂W ( r)(k) i ( j)

∂v ( ( j)

r)(k)

=W i

( r)(k)

∂E(j)

( j) −α ( r)(k)

( j) · ∂W ( r)(k) i ( j)

(H.3)

∂v

( r)(k)

∂E(j)

=W i

( j)

( −α r)(k)

·X i ( j) .

∂v

( r)(k)

( j)

Now, let us introduce the following denotation in the formula ( H.3 )

δ ( r)(k)

∂E(j)

( j) =−

(H.4)

∂v ( r)(k)

( j)

Then, we obtain the following formula.

( W r)(k)

( r)(k)

( + 1) = W r)(k)

( r)(k)

( j) + αδ

( j)X i ( j) . (H.5)

292 Appendix H: Formal Models for Artificial Intelligence Methods … The formula ( H.5 ) is analogous to the formula ( 11.16 ) in Sect. 11.2 including a

description of the back propagation method. 23 At the end of our considerations, we should derive a formula for δ ( r)(k) ( j) . Let us determine it, firstly, for neurons of the input layer and hidden layers.

∂v ( r +1)(m) ( j) δ

∂E(j)

( r)(k)

( j) =− ∂v ( r)(k) ( j) =−

. ( j) m · ∂v ( r)(k)

(H.6) ( j)

∂v ( r +1)(m)

By applying the formula ( H.4 ) and making use of the formula ( 11.1 ) introduced in Sect. 11.2 , we receive

W i +1)(m) ( j)X i +1)(m) ( j) δ

( r ( r)(k)

( r +1)(m)

∂v ( r)(k)

( j) From the formula ( 11.12 ) introduced in Sect. 11.2 we find that

( r +1)(m) ( r)(i)

δ ( r)(k) ( j) δ ( = r +1)(m)

( j)y ( j)

Appendix H Formal Models for Artificial Intelligence Methods: Mathematical Model of Neural Network Learning