Directory UMM :Data Elmu:jurnal:O:Operations Research Letters:Vol27.Issue3.2000:

Operations Research Letters 27 (2000) 119–126
www.elsevier.com/locate/dsw

Constrained Markovian decision processes: the dynamic
programming approach (
A.B. Piunovskiya; ∗ , X. Maob
a Department

of Mathematical Science, Division of Statistics and Operational Research, M and O Building, University of Liverpool,
Liverpool, L 69 7ZL, UK
b Strathclyde University, Glasgow, UK
Received 1 April 1999; received in revised form 1 May 2000

Abstract
We consider semicontinuous controlled Markov models in discrete time with total expected losses. Only control strategies
which meet a set of given constraint inequalities are admissible. One has to build an optimal admissible strategy. The main
result consists in the constructive development of optimal strategy with the help of the dynamic programming method. The
model studied covers the case of a nite horizon and the case of a homogeneous discounted model with di erent discount
c 2000 Elsevier Science B.V. All rights reserved.
factors.
Keywords: Markovian decision processes; Constrained optimization; Dynamic programming; Penalty functions


1. Introduction
Constrained Markov decision processes have been
studied by di erent authors during the last 15 years.
From the formal point of view, such problems can be
reformulated as linear programs on spaces of measures, and the convex analytic approach makes it
possible to formulate necessary and sucient conditions of the optimality (Kuhn–Tucker Theorem), to
establish the form of optimal strategies, and so on.
The corresponding results can be found for instance
in the monographs by Altman [1], Borkar [3], and
(

This work was performed under a grant of the Royal Society,
UK.
∗ Corresponding author.
E-mail address: piunov@liverpool.ac.uk (A.B. Piunovskiy).

Piunovskiy [6], and in the papers by Feinberg and
Shwartz [5] and by Piunovskiy [7].
The present article is devoted to the dynamic programming approach to constrained problems. The

main idea is similar to the penalty functions method;
as the result, we get the nonconstrained optimization
model with deterministic transitions where the Bellman principle is valid. If the problem were originally
deterministic then the basic idea would be to include
the accumulated cost corresponding to the constraints
in the state, and assign an immediate cost of in nity
if that part of state exceeds the given required bound
on the constraint. In the stochastic case, this approach
becomes a bit more complicated since we must also
remember the current probability distribution on the
states space. It should be emphasized that such an
approach can be developed only using the results

c 2000 Elsevier Science B.V. All rights reserved.
0167-6377/00/$ - see front matter
PII: S 0 1 6 7 - 6 3 7 7 ( 0 0 ) 0 0 0 3 9 - 0

120

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126


obtained earlier with the help of the convex analytic
approach (Theorem 1 below). We consider here the
case of total expected losses which turns to the nite
horizon model and to the homogeneous discounted
model (with di erent discount factors) in particular
cases.

2. Model description and auxiliary results
Let us consider the controlled model Z = {X; A; p}
where X is the Borel states space; A is the actions
space (the metric compact); pt (dy|x;
R a) is the continuous transition probability, that is X c(y)pt (dy|x; a)
is continuous function for each continuous c(·). As
usual a control strategy  is a sequence of measurable stochastic kernels t (da|ht−1 ) on A where ht−1 =
(x0 ; a1 ; x1 ; : : : ; at−1 ; xt−1 ). A strategy is called Markov
if it is of the form t (da|ht−1 ) = tm (da|xt−1 ) and is
called stationary if t (da|ht−1 )=s (da|xt−1 ). The initial probability distribution P0 (d x) ∈ P(X ) is assumed
to be xed. Here and further, P(X ) is the space of
all probability measures on a Borel space X equipped

with the weak topology.
As is known, each strategy  de nes the unique
probability measure P  on the trajectories space H∞ =
X × (A × X )∞ . The detailed description of these constructions can be found in [3,6] and in the monographs by Bertsekas and Shreve [2] and by Dynkin
and Yushkevich [4]. The integral with respect to the
measure P  is denoted by E  .
The traditional optimal control problem consists of
the minimization of the following functional:
#
"∞
X
t−1

(1)
0 rt (xt−1 ; at ) → min;
R() = E
t=1




where rt (·) is a cost function and 0 ¿ 0 is a discount factor. Problem (1) was investigated in [2– 4].
Such functionals are usually called total expected discounted losses as distinct from the average expected
losses of the type
" T
#
1  X
r(xt−1 ; at ) → min :
(2)
lim E

T →∞ T
t=1

Problem (2) was also studied in [3,4] but the present
article is devoted to models with total expected losses.
(However, see the Conclusion where the average

losses are mentioned.) If there are no costs in (1)
beyond time T then one usually puts 0 = 1 (the
case of a nite horizon). If the cost function r and

the transition probability p do not depend on the time
and 0 ∈ (0; 1) then we deal with the homogeneous
discounted model.
Let us assume that rt (x; a) is a lower-semicontinuous
lower-bounded function and the transition probability pt (dy|x; a) is countinuous. Suppose that
lower-semicontinuous lower-bounded functions
stn (x; a) are given as well as the discount factors
n ¿ 0 and real numbers dn , n=1; 2; : : : ; N . A strategy
 is called admissible if the inequalities
#
"∞
X
n

t−1 n
n st (xt−1 ; at )
S () = E
t=1

6 dn ;


n = 1; 2; : : : ; N

(3)

are satis ed. In what follows, expressions (1) and (3)
are assumed to be well de ned. To be more speci c,
we study either the model with a nite horizon, or the
case of discounted model n ∈ (0; 1), n = 0; 1; : : : ; N .
One must build an optimal admissible strategy; in
other words, one must solve problem (1) under
constraints (3). Such problems were investigated in
[1,3,5,6]. It should be noted that the strongest results
were obtained for the case of homogeneous discounted model with the common discount factor. The
homogeneous model with di erent discount factors
was studied in detail only for nite sets X and A [5].
In what follows, we shall need the following known
result.
Theorem 1. If there exists at least one admissible
strategy then there exists a solution of problem

(1); (3) de ned by a Markov strategy. If the model is
homogeneous and 0 = 1 = · · · = N ∈ (0; 1) then the
class of stationary strategies is sucient in problem
(1); (3).
In principle, one can build the solution of the constrained problem with the help of the Lagrange multipliers technique [6]. But it seems more convenient
to use the dynamic programming approach. The description of that approach presented in Section 3 is the
main result of this article.
It will be convenient to assume everywhere further
that st (·)¿0, rt (·)¿0. Obviously, in the cases of a

121

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126

nite horizon and of discounted model with n ∈ (0; 1),
n = 0; 1; : : : ; N , this assumption does not decrease the
generality. Besides, one can include the time t into the
state, x̃ = (x; t), and obtain a homogeneous model, in
which all the functions and the transition probability
do not depend on the time.

Remark 1. Theorem 1 was proved in the book [6]
assuming that X is compact. But all the proofs can be
generalized with the help of the results by Schal
 [8];
see the review [7] as well. Models with the nite or
countable set X were considered in [1,3].
Example. Let us consider the one-channel Markov
queueing system with losses. Put X = {0; 1} where
xt = 0 (xt = 1) means that the system is free (busy)
at the time moment t; A = {0; 1} where at = 0 (at =
1) means that the system e ects less intensive (more
intensive) servicing at the interval (t −1; t]. The initial
probability P0 (1) of system’s being busy, is known.
The transition probability at the step t is expressed by
the formula


p

1−p

pt (y|x; a) =
a
q



1 − qa

if
if
if
if

x = 0;
x = 0;
x = 1;
x = 1;

y = 1;
y = 0;

y = 0;
y = 1:

Here p is the probability of a customer arriving at the
interval (t − 1; t]; q0 (q1 ) is the probability of the end
of the service between the time moment t − 1 and
the moment t for the less (more) intensive regime;
0 ¡ q0 ¡ q1 . Lastly, ea is the cost owed for servicing
by the corresponding regime at the interval (t − 1; t];
e1 ¿ e0 ¿ 0; c ¿ 0 is the penalty caused by the loss
of an order which is paid o only if a customer came
into the busy system and was rejected. We have to
minimize the service consumption under the constraint
that the penalty for loss of requests is no bigger than
d. Therefore, we put

3. Dynamic programming approach
The main concept of this approach is close to the
penalty functions method. The new deterministic
model will be built in which the losses will be equal
to “+∞” for non-admissible strategies. If a strategy
is admissible then the value of the main functional
(1) does not change.
The state in the new deterministic model is the pair
(Pt ; Wt ) where Pt and Wt are the probability distribution
on X and the expected accumulated vector of losses
associated with the functions st (·). The action ãt =
ãt (d(x; a)) at the instant t is the probability measure
on X × A; ãt ∈ Ã = P(X × A). If the state is x̃ = (P; W )
then only those actions ã are available for which the
projection on X (the marginale) coincides with P:
Ã(P) = {ã ∈ Ã: ã(d x × A) = P(d x)}
is the space of available actions. The dynamics of the
new model are de ned by the relations
Z
ãt (d(y; a))pt (d x|y; a);
Pt (d x) = Pt (ãt )(d x) =
X ×A

n
; ãt )
Wtn =Wtn (Wt−1
Z
n
+ nt−1
=Wt−1

X ×A

n = 1; 2; : : : ; N; t = 1; 2; : : :

s(x; a) = xpc;

N =1

and investigate the discounted model (1), (3) with
the unique discount factor 0 = 1 = . The complete
solution of this problem is presented later.

(4)

under the given initial conditions P0 and W0 = 0.
Since st (·)¿0 and rt (·)¿0, the variable Wt does
not decrease in the new model under every control
strategy ˜ = {ãt }∞
t=1 . We have the model Z̃ = (X̃ ; Ã; p̃)
where X̃ = P(X ) × RN , Ã = P(X × A), the tranisition
probability p̃t (d ỹ|x̃; ã) is concentrated at the unique
point de ned by expressions (4) (to put it di erently,
the model Z̃ is deterministic). Let ˜ = {ãt }∞
t=1 be a
deterministic programmed control strategy in the new
model (all its elements are equipped with the tilde).
The loss functional is de ned by the formula

X
(5)
R̃t (Pt−1 ; Wt−1 ; ãt ) → inf ;
R̃()
˜ =
t=1

r(x; a) = ea ;

stn (x; a)ãt (d(x; a));



where

R̃t (P; W; ã) =

Z
 t−1
rt (x; a)ã(d(x; a)) if W 6d;
0
X ×A

+∞
in other cases:

122

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126

Obviously, there exists the trivial 1–1 correspondence between deterministic programmed strategies ˜
and Markov strategies m in the initial model:
˜ ↔ m : ãt (d(x; a)) = tm (da|x)Pt−1 (d x):
Remark 2. Since the model Z̃ is deterministic and
the initial state (P0 ; 0) is xed, every deterministic
feedback control in the model Z̃ can be presented as
the deterministic programmed strategy ã1 ; ã2 ; : : : : all
the elements of the sequence
(P0 ; W0 ); ã1 ; (P1 ; W1 ); ã2 ; : : :
are de ned univalently.
According to the construction, if ˜ ↔ m then
(
R(m ) if S(m )6d;
R̃()
˜ =
+∞
in other cases:
Therefore, it is sucient to solve problem (5) (to build
an optimal deterministic programmed strategy ):
˜ the
corresponding Markov control strategy m is a solution
of problem (1) and (3) in the initial model. Theorem
1 is used here. If there are no admissible strategies in
˜ = +∞.
the initial model then inf ˜ R̃()
Assume that all the functions stn (·), n = 1; 2; : : : ; N
are nite and continuous. Then the mappings Pt ; Wt
(4) are continuous. Besides, Ã(P) is compact for each
P [8] and the re
ection P → Ã(P) is quasicontinuous,
that is, for each convergent sequence limi→∞ Pi = P,
the sequence of arbitrarily chosen points ãi ∈ Ã(Pi )
has a limit point ã∞ ∈ Ã(P) [4]. Hence, the model Z̃
is semicontinuous [4].
Example. In the queueing system described in Section 2 we have
Ã(P) = {ã ∈ Ã = P(X × A): ã(1; 0) + ã(1; 1)
= P(1); ã(0; 0) + ã(0; 1) = P(0)}:
That is, if we have the probability distribution P on X
then an action ã is available if and only if the marginale
of ã coincides with P. Clearly, P(0) + P(1) = 1, so
we can omit the parameter P(0) everywhere. The dynamics (4) takes the form
Pt (1) = P(ãt )(1) = pPt−1 (0) + ãt (1; 0)(1 − q0 )
+ãt (1; 1)(1 − q1 )
= p − [ãt (1; 0) + ãt (1; 1)]
×[p − 1 + q0 ] + ãt (1; 1)(q0 − q1 );

Wt = Wt (Wt−1 ; ãt )
= Wt−1 + t−1 pc[ãt (1; 0) + ãt (1; 1)]:
The loss function is equal to
R̃t (P(1); W; ã) =
 t−1
 {[ã(0; 0) + ã(1; 0)]e0
+ [ã(0; 1) + ã(1; 1)]e1 } if W 6d;

+∞
if W ¿ d:

If ˜ = {ãt }∞
t=1 is a deterministic programmed strategy
then the control
ãt = {ãt (0; 0); ãt (0; 1); ãt (1; 0); ãt (1; 1)}
corresponds to the Markovian control law, tm (a|x), at
step t of the following form:
tm (0|0) =

ãt (0; 0)
;
ãt (0; 0) + ãt (0; 1)

tm (1|0) =

ãt (0; 1)
;
ãt (0; 0) + ãt (0; 1)

tm (0|1) =

ãt (1; 0)
;
ãt (1; 0) + ãt (1; 1)

tm (1|1) =

ãt (1; 1)
:
ãt (1; 0) + ãt (1; 1)

Here
ãt (0; 0) + ãt (0; 1) = Pt (0);
ãt (1; 0) + ãt (1; 1) = Pt (1)
and tm (0|0) = 0 and tm (1|0) = 1 if Pt (0) = 0,
tm (0|1)=0 and tm (1|1)=1 if Pt (1)=0. The complete
solution of this example is presented in Section 4.
Now we return to the general model (4) and (5). The
Bellman equation in this situation is of the standard
form:
vt (P; W ) = inf {R̃t+1 (P; W; ã)
ã ∈ Ã(P)

+vt+1 (Pt+1 (ã); Wt+1 (W; ã))};
t = 0; 1; : : :

(6)

It has a solution in the class of lower-semicontinuous
lower-bounded functions. We are interested only in the
mimimal nonnegative solution which can be obtained
by the successive approximations [2]:
v0 = 0;

v k+1 = U ◦ v k ;

123

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126
n

Initial values are d̃0 = dn . The Bellman equation for
problem (7) is of the form

where
[U ◦ w]t (P; W )

ˆ d̃; ã) + 0 v̂(P(ã); D(d̃; ã))}
v̂(P; d̃) = inf {r(P;

, inf {R̃t+1 (P; W; ã)

ã ∈ Ã(P)

ã ∈ Ã(P)

(8)

+wt+1 (Pt+1 (ã); Wt+1 (W; ã))};
where U is called the Bellman operator. The limit
function v∞ = limk→∞ v k exists and coincides with
the solution of interest, that is, the Bellman function. If a control ã∗t (P; W ) provides the in mum in
(6) then the feedback strategy ˜∗t = ã∗t (Pt ; Wt ) is optimal in the model Z̃. Notice that such a strategy
(a measurable mapping ã∗t (P; W )) exists since vt (·)
is lower-semicontinuous lower-bounded function and
Ã(P) is compact.
Therefore, problem (5) has a solution in the form
of a feedback control law and in the form of the deterministic programmed strategy (see Remark 2).
If we deal with the model with the nite horizon T
then the sequence v k converges by the nite number
of steps: vT +1 = v∞ .
Let us consider the homogeneous case when all the
cost functions and the transition probability do not
depend on the time. We introduce the new variable
1

2

N

d̃t = (d̃t ; d̃t ; : : : ; d̃t );

n

d̃t ,

dn − Wtn
:
nt

n

The component d̃t equals the expected loss of the
type n which is admissible on the remaining interval
{t + 1; t + 2; : : :}. The loss function at one step can be
rewritten in the form
Z


r(x; a)ã(d(x; a)) if d̃¿0;
X ×A
r(P;
ˆ d̃; ã) =


+∞
in other cases:
We deal with the standard discounted model

X
0t−1 r(P
ˆ t−1 ; d̃t−1 ; ãt ) → inf ;
R̂()
˜ =
t=1



n

n

G = {(P; d̃) ∈ P(X ) × RN : v̂(P; d̃) ¡ + ∞}

= {P(X ) × RN } \ G:
Then for every pair (P; d̃) ∈ G there exists ã ∈ Ã(P)
such that (P(ã); D(d̃; ã)) ∈ G. If the function r(·) is
bounded then the equation
v̂(P; d̃) =

inf

{r(P;
ˆ d̃; ã)

ã ∈ Ã(P) : (P(ã); D(d̃; ã)) ∈ G

+ 0 v̂(P(ã); D(d̃; ã))}

(9)

has a unique lower-semicontinuous uniformly
bounded solution on G. (The Bellman operator in
the right-hand part is a contraction in the space of
lower-semicontinuous bounded functions on G.) It
should be emphasized that this solution extended
by in nity on G provides the minimal nonnegative
solution of Eq. (6) with the help of the formula

d1 − W 1 d2 − W 2
;
;:::;
vt (P; W ) = 0t v̂ P;
1t
2t

dN − W N
:
Nt
Eq. (8) cannot have any other bounded solutions on
G. On the contrary, if v is a solution of (6) then v + c
is also the solution of Eq. (6) for every constant c.

(7)

where the homogeneous dynamic equation for Pt is
given by (4) and the dynamics of the component d̃t
can be de ned by the following equation:
d̃t = Dn (d̃t−1 ; ãt )


Z
1
n
n
d̃t−1 −
s (x; a)ãt (d(x; a)) :
=
n
X ×A

and we are interested in its minimal nonnegative solution. Eq. (8) can be also solved by the successive approximations method. It is simpler than Eq. (6) since
the time dependence is absent.
In actual practice, it is often easy to build the domain G where v̂(P; d̃) = ∞ (there are no admissible
strategies). One can show that G is an open set. Let

4. Solving the example
In this section, we present the complete solution of
the example described in Section 2. In what follows,
we use the notation g , p=(1 − ).
Clearly, the minimal penalty for loss of requests
corresponds to the case in which action 1 is always
taken. Hence, admissible strategies exist if and only if

124

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126

the initial distribution P(x) and the boundary d satisfy
the inequality
pc(g + P(1))
;
d¿
1 − + q1 + p
which determines the closed domain G ⊂ P(X ) × R1 .
Eq. (9) has the form
v̂(P; d̃) =

inf
ã ∈ Ã(P) : (P(ã); D(d̃; ã)) ∈ G

{e0 [ã(0; 0) + ã(1; 0)]

+e1 [ã(0; 1) + ã(1; 1)]
+ v̂(P(ã); D(d̃; ã))};

(10)

P(ã)(1) = p − [ã(1; 0)
+ã(1; 1)][p − 1 + q0 ] + ã(1; 1)(q0 − q1 );
D(d̃; ã) =

d̃ [ã(1; 0) + ã(1; 1)]pc

:



(11)

The multifunction Ã(P) was de ned in the previous
section. As before, the variable d̃ denotes the expected
penalty for loss of requests which is admissible on the
remaining time interval.
It is convenient to introduce the conditional probabilities
ã(0; 1)
;
ã(1|0) =
ã(0; 0) + ã(0; 1)
ã(1; 1)
;
ã(1|1) =
ã(1; 0) + ã(1; 1)
which are the two independent parameters of the action ã. (If the denominator P(0) or P(1) equals zero
then the corresponding fraction equals an arbitrary
number, say zero.) Notice that if ˜ = {ãt }∞
t=1 is a deterministic programmed strategy then the corresponding Markov strategy in the initial model has the form
tm (a|x) = ãt (a|x). Since the operators P and D do not
depend on ã(1|0), the optimal values of ã(0; 1) and
ã(1|0) are zero, and thus P(0)= ã(0; 0). Therefore, Eq.
(10) has the following form:
v̂(P; d̃) =

inf
(

p − P(1)[p − 1 + q0

d̃ P(1)pc
+ã(1|1)(q − q )]; −


1

0

!)

d̃(1 − + q1 + p) − pc(g + P(1))
:
P(1)(q1 − q0 ) pc

We omit the unnecessary parameter P(0) = 1 − P(1)
here. If P(1) = 0 then 1 (P; d̃) = +∞; this concerns
also 0 de ned later.
The unique continuous uniformly bounded on G
solution of Eq. (12) is de ned in what follows.
(i) If d̃¿(pc(g + P(1)))=(1 − + q0 + p) then
v̂(P; d̃) = (e0 =1 − ). In this case the unique value of
ã(1|1) providing the minimum in the right-hand part
of (12) is zero.
Recall that according to (11) the dynamics equations for Pt (1) and d̃t look like following:
Pt (1) = p − Pt−1 (1)[p − 1 + q0 + ãt (1|1)(q1 − q0 )];
d̃t =

Pt−1 (1)pc
d̃t−1

:



(13)

In the case considered, if the inequality
d̃0 = d¿

pc(g + P0 (1))
1 − + q0 + p

(14)

is satis ed at the initial step then one must always
choose the action ãt (1|1) ≡ 0, and inequality (14) is
satis ed at every instant t for d̃t and Pt (1). In this
connection, the constraint in the initial problem is
not essential and the solution of problem (1) and (3)
coincides with the solution of the unconstrained
problem (1).
(ii) As was established at the beginning of Section
4, if d̃ ¡ (pc(g + P(1)))=(1 − + q1 + p) then
v̂(P; d̃) = +∞ and all the actions are equivalent (all
control strategies are not admissible). If the inequality
d̃0 = d ¡

pc(g + P0 (1))
1 − + q1 + p

(15)

is satis ed at the initial step then
pc(g + Pt (1))
1 − + q1 + p

at every instant t ¿ 1 independently on the actions
ã1 ; ã2 ; : : : ; ãt ∈ Ã. In the case considered, there are no
admissible strategies at all since the constraint is too
restrictive.
(iii) If

× e0 + P(1)ã(1|1)(e1 − e0 )


1 (P; d̃) =

d̃t ¡

max{0;1−1 (P; d̃)}6ã(1|1)61

+ v̂

where

; (12)

pc(g + P(1))
pc(g + P(1))
6d̃6
1 − + q1 + p
1 − + q0 + p

125

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126

which follow from (13);
, ((d − pc(g +
P0 (1)))= d). Hence, in the case considered, the stationary control strategy ã de ned by the conditional
probabilities

then
(e1 − e0 )(g + P(1))
e0
+
v̂(P; d̃) =
1−
(q1 − q0 )


d̃(e1 − e0 )(1 − + q0 + p)
:
pc (q1 − q0 )

ãs (1|1) = ã∗ ;

In this case one can choose an arbitrary action
06ã(1|1)61 in the non-empty interval
(16)
1 − 1 (P; d̃)6ã(1|1)60 (P; d̃);
where
pc(g + P(1)) − d̃(1 − + q0 + p)
P(1)(q1 − q0 ) pc
¿0;

0 (P; d̃) =

1 (P; d̃)¿0 was de ned earlier.
Suppose that
pc(g + Pt (1))
pc(g + Pt (1))
6d̃t 6
1 − + q1 + p
1 − + q0 + p

(17)

at the initial moment t = 0. (Note that the value d̃0 = d
is given.) One can prove (by induction t = 0; 1; : : :)
that the action ãt (1|1) must be chosen arbitrarily from
the interval
max{0; 1 − 1 (Pt−1 ; d̃t−1 )}6ãt (1|1)
6min{1; 0 (Pt−1 ; d̃t−1 )}

(18)

at every epoch t (see (16)) and inequalities (17) are
satis ed at each step t. If the both inequalities in (17)
are strict at the step t − 1 then the left-hand part of
(18) is strictly less than the right-hand part and there
is a lot of di erent control strategies providing the
solution of the initial problem. One of those strategies
is of big interest:
pc(g + P0 (1))
ã∗t (1|1) ≡ ã∗ =
(q1 − q0 )d
1 − + q0 + p
:
(19)

(q1 − q0 )
It meets inequalities (17) for d̃t and Pt (1) at each
step; relations (18) are satis ed as well. The proof can
be carried out by the induction based on the explicit
formulae
1 −
t
+
t P0 (1);
Pt (1) = p
1−
p2 c
p2 c
t

d̃t =
(1 −
)(1 − ) (1 −
)(1 −
)
pc
t P0 (1)
+
1 −


ãs (1|0) = 0

is optimal. (It corresponds to the strategy s (1|0) = 0,
s (1|1) = ã∗ in the initial model.) One can show
that there are no other optimal stationary control
strategies.
Thus, the solution of the example described
in Section 2 looks like following. In the case
(17) one should always select the action a = 0
if the system is free; if the system is occupied then the probability of the more intensive
regime must be equal to expression (19). In the
case (14), the constraint is inessential and one
has always to choose the less intensive regime
a = 0. In the case (15), there are no admissible
strategies.
5. Conclusion
The dynamic programming approach can be used
also if the constraint inequalities must be satis ed almost surely. For example, if we consider the model
with the nite horizon
#
" T
X

rt (xt−1 ; at ) → min;
E


t=1

T
X

stn (xt−1 ; at )6dn ;

P  -a:s:;

n = 1; 2; : : : ; N;

t=1

then the state in the new model is the pair (x; W ) ∈ X ×
RN , the action a remains as in the original model, and
the Bellman equation similar to (6) can be rewritten
in the form:
ṽT (x; W ) =

N
X

I {W n ¿ dn } × “ + ∞”;

n=1

ṽt (x; W ) = inf {rt+1 (x; a)
a∈A
Z
+ pt+1 (dy|x; a)ṽt+1 (y; W
X

+st+1 (x; a))}t = 0; 1; : : : ; T − 1:

The initial value of W0 is zero.

126

A.B. Piunovskiy, X. Mao / Operations Research Letters 27 (2000) 119–126

Now consider problem (2) of minimizing the average expected losses in a homogeneous model. As is
known this problem can be reduced to the homogeneous discounted one if there exists a minorant [4]. In
this case, the corresponding constrained problem can
be also reduced to the constrained discounted problem stated in Section 2. (The details can be found in
[6].) So, the dynamic programming approach can be
suitable also for problems with the average expected
losses.
Note that the dynamic programming approach (in
problems with total expected losses) makes it possible to build all optimal deterministic programmed
strategies for the auxiliary model Z̃. The strategy ˜
is optimal in Z̃ if and only if ∀t = 1; 2; : : : the action
ãt provides the in mum in (6) at the current values
P = Pt−1 and W = Wt−1 . To put it di erently, the
presented method allows one to construct all Markov
control strategies which are optimal in the initial constrained problem. This is con rmed by the example
solved in Section 4.
Lastly, it should be emphasized that if the function
r is bounded then the symbol +∞ in the de nition of
the loss R̃ in the model Z̃ can be replaced by some
suciently large nite constant.

Acknowledgements
The main idea of this paper was discussed with S.
Gaubert of INRIA (Paris). Besides, the authors are
thankful to the editor and to the anonymous referee
for constructive comments which helped to improve
this article.
References
[1] E. Altman, Constrained Markov Decision Processes, Chapman
& Hall/CRC, Boca Raton, 1999.
[2] D.P. Bertsekas, S.E. Shreve, Stochastic Optimal Control,
Academic Press, N.Y-S. Francisco-London, 1978.
[3] V.S. Borkar, Topics in Controlled Markov Chains, Vol. 240,
Longman Scienti c and Technical, England, 1991.
[4] E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes
and their Applications, Springer, New York, 1979.
[5] E.A. Feinberg, A. Shwartz, Constrained Markov decision models
with discounted rewards, Math. Oper. Res. 20 (1995) 302–320.
[6] A.B. Piunovskiy, Optimal Control of Random Sequences
in Problems with Constraints, Kluwer Academic Publishers,
Dordrecht, 1997.
[7] A.B. Piunovskiy, Controlled random sequences: the convex
analytic approach and constrained problems, Russ. Math.
Surveys 6 (1998) 1233–1293.
[8] M. Schal, On dynamic programming: compactness of the space
of policies, Stochastic Process. Appl. 3 (1975) 345–364.