MODELING MACHINE FAILURES

11.7 MODELING MACHINE FAILURES

Process stoppages are unavoidable in manufacturing and service systems. In parti- cular, machine failures of various kinds constitute an important source of idleness and variability in such systems. When modeling systems subject to failures, the modeler is usually interested in the long-run probabilities of down states, which translate operationally into the long-range fraction of time spent in those states. Certainly, efficient operation requires that downtimes be minimized, since these represent loss of production time. Simulation can help the modeler understand the impact of failures on system performance.

A stochastic process that models failures is characterized either by the statistical properties of the time intervals separating consecutive failures, or by the counting process that keeps track of the number of failures within a given period of time. These two characterizations are mathematically equivalent. However, the former characteriza- tion is commonly used to generate failures in simulation runs. In fact, failures can be modeled as a specialized arrival process of high priority, whose transactions simply preempt the server.

A typical failure scenario at a single workstation unfolds as follows. After a period of normal operation (uptime), a failure event occurs, the workstation stops its processing, and then experiences a downtime while undergoing a repair. (Sometimes an additional

248 Modeling Production Lines delay for repair setup needs to be modeled, but this delay can usually be absorbed into a

longer repair time.) Failures that occur while machines are actually processing jobs are called operation dependent. However, the new breed of highly computerized machines may fail at any time, regardless of machine status. Such failures are called operation independent, and may include machine malfunctions, startups, cleanups, adjustments, and other delays specific to a particular system. The two fundamental machine states, Idle and Busy, must then be augmented by additional failure and stoppage states, such as Down, Cleaning, Adjustment, and so on.

Arena admits any number of user-defined states in addition to the built-in ones (called auto-states). In particular, an Arena resource has four auto-states: Idle, Busy, Failed, and Inactive, and at any point in time a resource is in one of these states.

A transition from one state to another is caused by an event occurrence. For instance,

a failure-arrival event automatically causes the machine to undergo a transition to the Failed state. On the other hand, user-defined states and their transitions can be programmed entirely at the modeler's discretion. We are usually interested in the long- run probabilities of these states; such probabilities can be estimated via the Frequency option in the Statistic module.

The modeler can modulate the capacity of a resource over time by defining a time- varying capacity level in the Schedule module (see Section 5.8 for more details). In particular, to model forced downtimes (e.g., preventive maintenance), a resource can be inactivated by setting its capacity to zero, in which case the Inactive auto-state will have a non-zero probability.

Suppose that Filling Process in the packaging line model of Section 11.3 fails randomly and that it needs an adjustment after every 250 departures from the work- station. Assume that uptimes (times between a repair completion and the next failure, or time to failure) are exponentially distributed with a mean of 50 hours, while repair times are uniformly distributed between 1.5 hours to 3 hours. Also, the aforementioned adjustment time is uniformly distributed between 10 minutes to 25 minutes. Assume further that Packing Process can also experience random mechanical failures, and downtimes are triangularly distributed with a minimum of 75 minutes, a maximum of

2 hours, and a mode at 90 minutes. The corresponding uptimes are exponentially distributed with a mean of 25 hours. Finally, assume that random failures occur only while the machines are busy (operation-dependent failures). We shall refer to the modified packaging line model as the failure-modified model.

Figure 11.25 illustrates how failures in the failure-modified model are specified in

a dialog spreadsheet for the Failure module from the Advanced Process template panel. The Name column in Figure 11.25 specifies failure names, while the Type column selects the type of failure arrivals: Time (for time-based arrivals) or Count (for count- based arrivals). For instance, random mechanical failures are normally time based, since

Figure 11.25 Dialog spreadsheet of the Failure module.

Modeling Production Lines 249 their arrivals call for interfailure time specification. In contrast, if a machine requires a

cleanup action whenever it completes processing a prescribed number of units, then the cleanup stoppage is count based. Note that in Figure 11.25, the failure/stoppage named Adjustment is declared to be count-based.

For time-based failures, the Up Time column specifies the time interval to the next failure (after a repair completion), while the Up Time Units column specifies the corresponding time unit. Similarly, for count-based failures, the Count column specifies the number of entity departures to the next failure (250 for the Adjustment failure). Note that unused column entries are shaded depending on failure type.

The Down Time column specifies the length of downtimes, while the Down Time Units column specifies the corresponding time unit. Finally, the Uptime in this State only column is used for time-based failures to specify the state in which the resource must be for the failure to occur. For example, the time-based failures in Figure 11.25 can only occur when the underlying resources are in the Busy state.

Arena provides a mechanism for defining resource states and for linking them to failures/stoppages in the form of the StateSet spreadsheet module from the Advanced Process template panel. Figure 11.26 illustrates the use of the StateSet module for the packaging line model of Section 11.3. The left dialog spreadsheet of Figure 11.26 specifies a state set name (under the Name column) and the number of associated states (under the States column). For example, the first row specifies a state set, called Filling States, for the Filler resource with 4 states (the button labeled 4 rows). Clicking that button pops up the dialog box to the right, which displays detailed state information. There, the column State Name displays user-defined or auto-state names, while the column AutoState or Failure indicates the association of each state name with the corresponding auto-state or user-defined failure name.

Finally, Figure 11.27 illustrates for the failure-modified model how the Resource module is used to associate resource states with failures and the action to be taken on failure occurrences. The bottom dialog spreadsheet of Figure 11.27 specifies in the first row that resource Filler (column Name) has 2 types of failures (the button labeled

2 rows in column Failures). Clicking that button pops up the dialog spreadsheet at the top of Figure 11.27, which shows that resource Filler has two failures, called Random Failures_F and Adjustment (column Failure Name). The Failure Rule column specifies the requisite action to be taken on the unit entity (or entities) being processed when a time-based failure occurs (it does not apply to count-based failures). Actions are specified via options, the most common of which follow:

Figure 11.26 Dialog spreadsheets of the StateSet module (left) and resource Filler states (right).

250 Modeling Production Lines

Figure 11.27 Resource dialog spreadsheet (bottom) and Failures dialog spreadsheet (top) in the failure-modified packaging line model.

The Preempt option starts a downtime by suspending the resource immediately on failure arrival, so that the remaining processing of the current unit entity will resume once the downtime is over. The Wait option allows the current unit entity to finish processing, after which the resource is suspended and downtime begins. The Ignore option starts the downtime after the current unit entity finishes processing. However, only that portion of the downtime following the current unit entity comple- tion is recorded (in contrast, the Wait option records the full downtime).

In our example, failure Random Failures_F will apply the Preempt option, while failure Adjustment, being count-based, will formally apply the Wait option, even though any other option could have been selected (recall that options do not apply to count- based failures).

The state probabilities of each resource can be obtained by requesting the collection of Frequency statistics in the Statistic module with the State option selected in its Frequency Type column. Figure 11.28 displays the resulting Frequencies report for the failure-modified model.

It is instructive to compare the performance of the original packaging line model of Section 11.3 with its failure-modified version. Everything else being the same, we expect the failure-modified model to show poorer performance than the original model. Indeed, a comparison of Figure 11.28 to Figure 11.12 reveals that the resources downstream of the Filler resource in the failure-modified model experience markedly increased idleness. The explanation of this outcome is straightforward. Since the Filler resource in the failure-modified model is shut down by random failures, it cannot produce as much as in the original model. Consequently, downstream processes are more frequently starved, leading to overall increased idleness in the system. In a similar vein, Figure 11.29 displays the resultant User Specified report for the failure-modified model.

A comparison of Figure 11.29 with Figure 11.11 reveals additional evidence that the failure-modified model performs at a lower level than the original one. Because Filling

Process has increased idleness in the failure-modified model, its mean interdeparture

Modeling Production Lines 251

Figure 11.28 Frequencies report for the failure-modified packaging line model.

time increases to about 12.5 seconds (as opposed to 8 seconds in the original model), and concomitantly, its system throughput is only about 0.08 (as opposed to 0.125 units per second in the original model). Thus, this example amply demonstrates the deleterious effects of machine failures and stoppages on system performance measures. The econo- mic consequences of the resultant system performance must necessarily follow suit.

The impact of machine failures on system performance and the reduction of this impact by placing buffers between machines have been extensively studied. For further discussions on the subject, see Buzacott and Shanthikumar (1993), Gershwin (1994), Altiok (1997), and Papadopoulos et al. (1993). Analysis of manufacturing networks with infinite buffers is found in Whitt (1983).