Discrete-time DP Set Datang di SMAN 8 Batam optimization.1

Chapter 9 Dynamic programing SEQUENTIAL DECISION PROBLEMS: DYNAMIC PROGRAMMING FORMULATION The sequential decision problems discussed in the last three Chapters were analyzed by varia- tional methods, i.e., the necessary conditions for optimality were obtained by comparing the op- timal decision with decisions in a small neighborhood of the optimum. Dynamic programming DP is a technique which compares the optimal decision with all the other decisions. This global comparison, therefore, leads to optimality conditions which are sufficient. The main advantage of DP, besides the fact that it give sufficiency conditions, is that DP permits very general problem for- mulations which do not require differentiability or convexity conditions or even the restriction to a finite-dimensional state space. The only disadvantage which unfortunately often rules out its use of DP is that it can easily give rise to enormous computational requirements. In the first section we develop the main recursion equation of DP for discrete-time problems. The second section deals with the continuous-time problem. Some general remarks and bibliographical references are collected in the final section.

9.1 Discrete-time DP

We consider a problem formulation similar to that of Chapter VI. However, for notational conve- nience we neglect final conditions and state-space constraints. Maximize N −1 X i=0 f i, xi, ui + ΦxN subject to dynamics: xi + 1 = f i, xi, ui , i = 0, 1, . . . , N − 1 , initial condition: x0 = x , control constraint: ui ∈ Ω i , i = 0, 1, . . . , N − 1 . 9.1 In 9.1, the state xi and the control ui belong to arbitrary sets X and U respectively. X and U may be finite sets, or finite-dimensional vector spaces as in the previous chapters, or even infinite- dimensional spaces. x ∈ X is fixed. The Ω i are fixed subsets of U . Finally f i, ·, · : X × U → R, Φ : X → R, f i, ·, · : X × U → X are fixed functions. 121 122 CHAPTER 9. DYNAMIC PROGRAMING The main idea underlying DP involves embedding the optimal control problem 9.1, in which the system starts in state x at time 0, into a family of optimal control problems with the same dynamics, objective function, and control constraint as in 9.1 but with different initial states and initial times. More precisely, for each x ∈ X and k between and N − 1, consider the following problem: Maximize N −1 X i=k f i, xi, ui + ΦxN , subject to dynamics: xi + 1 = f i, xi, ui, i = k, k + 1, . . . , N − 1, initial condition: xk = x, control constraint: ui ∈ Ω i , i = k, k + 1, ·, N − 1 . 9.2 Since the initial time k and initial state x are the only parameters in the problem above, we will sometimes use the index 9.2 k,x to distinguish between different problems. We begin with an elementary but crucial observation. Lemma 1: Suppose u ∗ k, . . . , u ∗ N − 1 is an optimal control for 9.2 k,x , and let x ∗ k = x, x ∗ k + 1, . . . , x ∗ N be the corresponding optimal trajectory. Then for any ℓ, k ≤ ℓ ≤ N − 1, u ∗ ℓ, . . . , u ∗ N − 1 is an optimal control for 9.2 ℓ,x ∗ ℓ . Proof: Suppose not. Then there exists a control ˆ uℓ, ˆ uℓ + 1, . . . , ˆ uN − 1, with corresponding trajectory ˆ xℓ = x ∗ ℓ, ˆ xℓ + 1, . . . , ˆ xN , such that N −1 X i=ℓ f i, ˆ xi, ˆ ui + Φˆ xN N −1 X i=ℓ f i, x ∗ i, u ∗ i + Φx ∗ N . 9.3 But then consider the control ˜ uk, . . . , ˜ uN − 1 with ˜ ui u ∗ i , i = k, . . . , ℓ − 1 ˆ ui , i = ℓ, . . . , N − 1 , and the corresponding trajectory, starting in state x at time k, is ˜ xk, . . . , ˜ xN where ˜ xi = x ∗ i , i = k, . . . , ℓ ˆ xi , i = ℓ + 1, . . . , N . The value of the objective function corresponding to this control for the problem 9.2 k,x is N −1 X i=k f i, ˜ xi, ˜ ui + Φ˜ xn = ℓ−1 X i=k f i, x ∗ i, u ∗ i + N −1 X i=ℓ f i, ˆ xi, ˆ ui + Φˆ xN N −1 X i=k f i, x ∗ i, u ∗ i + Φx ∗ N , 9.1. DISCRETE-TIME DP 123 by 9.3, so that u ∗ k, . . . , u ∗ N − 1 cannot be optimal for 9.2 k , x, contradicting the hypothesis. end theorem From now on we assume that an optimal solution to 9.2 k,x exists for all 0 ≤ k ≤ N − 1, and all x ∈ X. Let V k, x be the maximum value of 9.2 k,x . We call V the maximum value function. Theorem 1: Define V N, · by V N, x = Φx. V k, x satisfies the backward recursion equa- tion V k, x = Max{f , k, x, u + V k 1 , f k, x, u, |u ∈ Ω k }, 0 ≤ k ≤ N − 1 . 9.4 Proof: Let x ∈ X, let u ∗ k, . . . , u ∗ N − 1 be an optimal control for 9.2 k,x , and let x ∗ k = x, . . . , x ∗ N be the corresponding trajectory be xk = x, . . . , xN . We have N −1 X i=k f i, x ∗ i, u ∗ i + Φx ∗ N ≥ N −1 X i=k f i, xi, ui + ΦxN . 9.5 By Lemma 1 the left-hand side of 9.5 is equal to f k, x, u ∗ k + V k + 1, f k, x ∗ , u ∗ k . On the other hand, by the definition of V we have N −1 X i=k f i, xi, ui + ΦxN = f k, x, uk +{ N X i=k+1 f i, xi, ui + ΦxN ≤ f k, x, u, k + V k + 1, f k, x, uk} , with equality if and only if uk + 1, . . . , uN − 1 is optimal for 9.2 k+1,xk+1 . Combining these two facts we get f k, xu ∗ k + V k + 1, f k, x, u ∗ k ≥ f k, x, uk + V k + 1, f x, k, uk , for all uk ∈ Ω k , which is equivalent to 9.4.end theorem Corollary 1: Let uk, . . . , uN − 1 be any control for the problem 9.2 k,x and let xk = x, . . . , xN be the corresponding trajectory. Then V ℓ, xℓ ≤ f ℓ, xℓ, uℓ + V ℓ + 1, f ℓ, xℓ, uℓ, k ≤ ℓ ≤ N − 1 , and equality holds for all k ≤ ℓ ≤ N − 1 if and only if the control is optimal for 9.2 k,x . Corollary 2: For k = 0, 1, . . . , N − 1, let ψk, · : X → Ω k be such that f k, x, ψk, x + V k + 1, f k, x, ψk, x = Max{f k, x, u + V k + 1, f k, x, u|u ∈ Ω k } . Then ψk, ·, k = 0, . . . , N − 1 is an optimal feedback control, i.e., for any k, x the control u ∗ k, . . . , u ∗ N − 1 defined by u ∗ ℓ = ψℓ, x ∗ ℓ, k ≤ ℓ ≤ N − 1, where 124 CHAPTER 9. DYNAMIC PROGRAMING x ∗ ℓ + 1 = f ℓ, x ∗ ℓ, ψℓ, x ∗ ℓ, k ≤ ℓ ≤ N − 1 , x ∗ k = x , is optimal for α k,x . Remark: Theorem 1 and Corollary 2 are the main results of DP. The recursion equation 9.4 al- lows us to compute the value function, and in evaluating the maximum in 9.4 we also obtain the optimum feedback control. Note that this feedback control is optimum for all initial conditions. However, unless we can find a “closed-form” analytic solution to 9.4, the DP formulation may necessitate a prohibitive amount of computation since we would have to compute and store the val- ues of V and ψ for all k and x. For instance, suppose n = 10 and the state-space X is a finite set with 20 elements. Then we have to compute and store 10 × 20 values of V , which is a reasonable amount. But now suppose X = R n and we approximate each dimension of x by 20 values. Then for N = 10, we have to compute and store 10x20 n values of V . For n = 3 this number is 80,000, and for n = 5 it is 32,000,000, which is quite impractical for existing computers. This “curse of dimensionality” seriously limits the applicability of DP to problems where we cannot solve 9.4 analytically. • Exercise 1: An instructor is preparing to lead his class for a long hike. He assumes that each person can take up to W pounds in his knapsack. There are N possible items to choose from. Each unit of item i weighs w i pounds. The instructor assigns a number U i 0 for each unit of item i. These numbers represent the relative utility of that item during the hike. How many units of each item should be placed in each knapsack so as to maximize total utility? Formulate this problem by DP.

9.2 Continuous-time DP