STATISTICS COLLECTION FROM REPLICATIONS

9.2 STATISTICS COLLECTION FROM REPLICATIONS

Recall that each replication produces a random history (sample path), from which various statistics are computed (see Section 3.10). These statistics are estimates of various parameters of interest (probabilities, means, variances, etc.). More formally, suppose we are interested in a parameter y of the system (e.g., the mean flow times of jobs through a workstation or their blocking probability on trying to enter a finite buffer). The simulation will then be programmed to produce an estimator, ^ Y , for the true but unknown parameter, y, which evaluates to some estimate, ^ Y

¼ ^ y . Note carefully the distinct meanings of the related entities y, ^ Y , and ^ y :

Output Analysis 169 y is a deterministic but unknown parameter (possibly vector-valued).

Y ^ is a variate (random variable) estimator of y. y ^ is some realization of ^ Y .

For each replication r, the estimator ^ Y yields a separate estimate, ^ Y (r) ¼ ^ y (r). Furthermore, ^ Y (r) is some function of the history of replication r. For example, suppose

that y is the (unknown) mean of flow times through a workstation. An estimator ^ Y of y might then be chosen as the sample mean (see Section 3.10) of job flow times,

fX 1 (r), . . . , X n (r) g, where X j (r) is the j-th job flow time observed during replication r and n is the number of jobs that departed from the workstation. In short, ^ Y is the

X (r) observed in the course of replication r. As another example, suppose that y is the (unknown) machine utilization. An estimator ^ Y of y is chosen as the fraction of time the workstation is busy processing in the course of replication r. In short, ^ Y is the time average of the corresponding history fY t (r) : A(r) where

Y t (r) ¼ 0,

if machine is busy

if machine is idle

is the indicator function of the machine status at time t during replication r, and [A(r),B(r)] is the time interval of replication r.

9.2.1 S TATISTICS C OLLECTION U SING I NDEPENDENT R EPLICATIONS

Although output analysis can extract information from a single replication, in a typical situation a simulation program is run n times with independent initial random number generator (RNG) seeds (see Chapter 4). These simulation runs constitute n

independent replications that produce a random sample ^ f y (1), . . . , ^ y (n) g of estimates of y , drawn from the underlying estimator ^ Y ; recall that “random” means here that the sample is drawn from iid (independent and identically distributed) experiments. Output analysis would then use this random sample to form a pooled estimate for y, based on all n replications, in order to increase the statistical reliability of estimating y (see Section 9.3). Examples include the sample mean, sample variance, and confidence intervals.

As an example, consider again the sequence of job flow times generated in repli- cation r, and denote the j-th job flow time by X j (r) and its realization by x j (r). Suppose the estimator of the mean flow time is the sample mean of the flow times in each replication, yielding the estimates

1 l(r) X

x(r) ¼

x j (r), r

where l(r) is the number of flow time observations recorded in replication r. Note carefully that within each replication r, the individual flow times X j (r), j ¼ 1, . . . , l(r)

¼ 1, . . . , l(r)

X (r), r ¼ 1, . . . , n. This independence is due to the fact that distinct replications using independent streams of random numbers are statistically independent, and consequently, so are the statistics computed from them. As we shall see, independent estimators are necessary in order to

170 Output Analysis take advantage of the central limit theorem (see Section 3.8.5) to obtain the distribution

of pooled estimators and construct from them confidence intervals for the true param- eter to be estimated (see Section 9.4).

9.2.2 S TATISTICS C OLLECTION U SING R EGENERATION P OINTS AND B ATCH M EANS

In the previous section we assumed that each replication yields precisely one estimate for the statistic under consideration. For example, each replication of a workstation model was assumed to yield one estimate for the mean flow time, one estimate for the probability of blocking, and so on. However, replications can be time consuming. In addition, each replication warm-up period represents additional over- head, if not outright “wastage.”

One obvious way to speed up estimation is to collect several estimates for the same statistic from any given replication. Note that such an estimation strategy would automatically reduce “warm-up” overhead. We would like these multiple estimates to

be drawn from a sequence of estimators that are independent (or approximately so), both across replications (as is the case for independent replications), as well as within replications. In fact, if we can obtain independent estimates within replications, it would suffice to run just one replication, albeit a long one. To this end, one must identify when estimators within replications are indeed independent or approximately so.

A case in point is the class of regenerative (renewal) processes (see Section 3.9.3). Recall that such processes have specific time points (usually random), T 1 ,T 2 , . . ., such that the partial process histories over the regenerative (renewal) intervals ½T j ,T j þ1 ) are iid. Consequently, statistical estimates collected over distinct regenerative intervals are also iid. In simple cases, one can actually identify such renewal intervals. For example, in a queueing system with independent interarrival times and service times independent of arrivals, the random arrival times of jobs at an idle system are regenerative points within a replication. Generally, however, identifying regeneration points in a complex system that are justified by theoretical arguments is difficult or impossible for all practical purposes.

A practical way of collecting multiple estimates from a single replication is the batch means method. As the name suggests, the main idea is to group observations into batches, and collect one estimate from each batch, provided the batch means are iid or approximately so. Note carefully that if the estimates are markedly dependent, then the central limit theorem (see Section 3.8.5) cannot be justifiably invoked to construct confidence intervals for the true parameters, based on pooled estimators. In batch means estimation, a replication generates a system history (sample path realization) from which statistical estimates are formed depending on whether the time index of the history is discrete or continuous.

g, for example, the flow times of n jobs. In this case, the total of n observations is divided into m batches of size k each (n ¼ mk). This results in m batches (subsamples)

Suppose the observed history is a discrete sample fx 1 ,...,x n

g, . . . , fx m1 ,...,x mk g: From each batch j ¼ 1, . . . , m, a separate estimate y ^ j is formed from the k observations

fx 11 ,...,x 1k

g, fx 21 ,...,x 2k

fx j1 ,...,x jk g of that batch only. The replication thus yields a set of estimates

Output Analysis 171

g. The batch size, k, is selected to be large enough so as to ensure that the corresponding estimators are iid or approximately so.

f ^ y 1 ,...,^ y m

Suppose the observed history is a continuous sample fx t number of jobs in the buffer during time interval [a,b]. In this case, the interval [a,b] is divided into m batches of length c ¼ (b a) =m each, resulting in m subhistories,

fx t : t 2 ½a, a þ cŠg, fx t : t 2 ½a þ c, a þ 2cŠg, . . . , fx t : t 2 ½a þ (m 1)c, b Šg: From each batch j ¼ 1, . . . , m, a separate estimate ^ y j is formed from the observations

fx t : t 2 ½a þ (j 1)c, a þ jcŠg of that batch only. The replication thus yields again a set of estimates f y ^ 1 ,...,^ y m g. The batch length, c, is selected to be long enough so as to ensure that the corresponding estimators are iid or approximately so.