RELIABILITY ANALYSIS OF REPAIR TIME DATA.

(1)

RELIABILITY ANALYSIS OF REPAIR TIME DATA

by

BURHANUDDIN MOHD ABOOBAIDER

Thesis submitted in fulfillment of the requirements for the degree of

Master of Science


(2)

ABSTRACT

Troubleshooting time may vary on the basis of different background characteristics associated with machines and repair crews. Reliability analysis of troubleshooting time is important to ensure that the operating units achieve maximum performance over time with minimum breakdowns. Identification of the risk factors provides extensive and meaningful information on the causes of lost time to evaluate maintenance activities and coordination effectiveness. The present study examines these potential risk factors and provides necessary inputs in order to improve maintenance operation performance. The study proposes an alternative solution on assessing dynamic repair crew performance based on the progress of their repair time with quantitative measures. Once repair time patterns are detected, the service provider can provide a valid feedback to the repair crews about how current results compare with respect to expected results. This will lead the team towards breakthrough performance. The case study uses a data set of 1169 air conditioner maintenance records in 2001 at the University Science Malaysia main campus in Penang. The sample consists of repair time data, background characteristics of the technicians and some information on the air conditioner type, i.e., split units and window units. Modelling of repair time with some reliability analysis is conducted for both the non-parametric and semi-parametric approaches. The estimates of the reliability function for the repair time problem are critically examined and the major findings are highlighted in this study. The Product limit Kaplan-Meier method is used to estimate the reliability functions with stratified and competing risks models. The reliability estimates enable us to classify the technicians into three major groups based


(3)

on their respective troubleshooting time performance. The Cox proportional hazards model is fitted to examine the relationships between repair time and various risk factors of interest. This study shows how to measure the risk ratios of repair time delay based on the risk factors, i.e., technicians’ experience, qualifications and their ages using the model. The key element of the reliability estimates shows that experienced technicians reduce the repair time. The results show that a larger proportion of moderate and complex problems can be resolved by technically qualified technicians. The results can be used as a benchmark for developing quality service, products and in enhancing competitiveness. The present study sets a tone for the direction of future work in the development of models for reliability analysis on delays in breakdown maintenance. The statistical package, Statistical Analysis System and Microsoft Excel have been used in this study.


(4)

CHAPTER 1 INTRODUCTION

1.1 Background

In general, failure or downtime can be defined as the total amount of time the equipment would normally be out of service owing to the failure from the moment it fails until the moment it is fully repaired and operating again. Some repairing activities will take place in between the failure time and the time the item is reassembled. Repair is defined as restoring the item by replacing the parts or to sort out what parts have failed. Downtimes of a machine in production will result in capacity loss, poor product quality and customer dissatisfaction. These downtimes usually occur depending on the effectiveness of the reliability programs, which are executed by the organization.

International variations in the production lines, environmental factors and technology along with the uniqueness in product construction make international construction comparisons onerous but not impossible. Such comparisons can provide ample solutions and approaches, which may lead to performance improvements for the operation of repairable and non-repairable systems across the globe. Today, the time based quality concept is very important for customers and widely used for products, systems, processes and services which extend from various fields as an interdisciplinary concept. The performance of a maintenance process, in bringing up the machine, is very difficult to measure accurately. The outcome of a service process is inherently


(5)

much more inconsistent in quality than its manufacturing counterparts and the service process is difficult and expensive to control.

The changing world of maintenance is discussed by Moubray (1997). Over the past twenty years, maintenance has changed, perhaps more than any other management discipline. This is due to huge increases in the number and variety of physical assets with complex designs, which must be maintained throughout the world.

From its beginning in the 1930’s, the evolution of maintenance can be traced through three generations. The first generation covers the period up to World War II, where industries were not very highly mechanized and downtimes were not seriously of concern. Most equipment were also simple and designed for only specific missions which make them easy to repair. Most of the maintenance activities involved cleaning, servicing and lubrication routines only. Things changed dramatically in the second generation during World War II. Wartime pressures increased the demand for goods of all kinds while the supply of industrial manpower dropped sharply, which led to an increase in mechanization. By the 1950’s, equipment became more numerous and more complex in design, and any failures would have substantial impact on the operation, which makes downtime an important issue. This led to the idea that equipment failures should be prevented, and this in turn led to the growth of maintenance planning and control systems.

The mid-seventies was the third generation, where the process of change in the industrial sectors had increased at a faster rate. Most equipment are capable of multipurpose and multitasking activities. This makes repair activities more


(6)

sophisticated. The rapid growth of this mechanization and automation implies that reliability and availability had become the key issues in sectors as diverse as health care, data processing and telecommunications until today. Today, a major challenge is maintenance going global, where virtual factory concepts are widely introduced. Tremendous innovations in the networking and information communication technology arena make these challenges achievable in the real world. Advanced techniques which implement backup and standby strategies as well as computer simulation and automatic monitoring are widely used. A maintenance crew should be able to repair the equipment remotely using factory automation tools such as teamstation, LANDesk, remote power management, PCAnywhere and other remote control devices in order to survive in the global market.

The cost of maintenance itself is still rising along with technology, in absolute terms and as a proportion of total expenditure. In some industries, one of the highest spending elements in production is the operating cost. Thus, achieving a low production cost include not only the study of techniques, but also the decision on which items are worthwhile to prioritize in the respective functional group in the organization. Moubray (1997) also discussed some new developments in maintenance. In addition, the following important points are highlighted:

i) Designing equipment with a much greater emphasis on reliability such as introducing backup and standby strategies;

ii) Team work and flexibility to optimize the maintenance crew performance. (See Burhanuddin (2003a));


(7)

iii) Decision support tools, such as hazards function studies, failure modes and effects analysis. (See Burhanuddin (2003b));

iv) Expert systems, such as automatic condition monitoring and remote maintenance management control.

Since then, more thorough analysis have been obtained such as failure complexity studies, root cause failure analysis, response time analysis, repair time analysis and delay time analysis. This study covers repair time analysis using statistical reliability theory and methodology. Details on the repair time framework are given in Chapter 3.

1.2 Troubleshooting Work Flow

The reliability of a machine has been defined by Bentley (1993) as the probability that the machine continues to meet the specifications, over a lifetime, subject to given environmental conditions. In contrast, the unreliability of a machine can be defined as the probability that the machine fails to meet the specifications, over a lifetime, subject to given environmental conditions. Both reliability and unreliability vary with machine lifetime intervals. The time interval for a given piece of machine in a failed state is not constant, but varies randomly according to certain circumstances.

Analysis could be simplified by separating the failure interval into two major parts, i.e. response time and repair time. When users purchase machines from a manufacturer and install them, only the initial cost is invested. When the time goes beyond the warranty period, the user will also have to pay all the costs of system failures, throughout the


(8)

lifetime of the system. Thus, repairing activities during a downtime should be at minimum timeframe along the machine lifetime. The present study proposed the analysis of troubleshooting time using reliability measures.

Practically, if any machine fails, it will have an impact on the operating envelope of plant systems and may lead to product quality problems. According to Moubray (1997), the initial clarification on the root causes in troubleshooting activities is usually the most difficult part of the investigation. For example, in some cases, finding the root cause of a failure may go undetected for days, weeks or even months. The second difficulty is isolating the specific point where defects or deviations occur. As a result, the troubleshooting process must evaluate multiple process areas until the source of the problem can be absolutely isolated and repaired. Capacity restrictions or loss of production capacity is another ideal application in troubleshooting activities. The logical step by step approach can be used, combined with its verification testing methods to isolate the complexity and maturity of the manufacturing maintenance process. All these processes will delay repair time and these can be thought of as the direct or indirect influence of either internal or external risk factors. The factors can be associated with the dependent variable, repair time intervals.

A thorough investigation may require further analysis of one or more of the production areas that precede the suspect process. Repair time reliability analysis of the past can be used to estimate and forecast strategy to face a similar kind of failure in the future. Machine failures and troubleshooting activities in the maintenance field are categorized as repetitive measures, where customers will report any breakdown issue to the service providers. Service providers will assign the job orders to the technicians. Technicians


(9)

will take the job orders, plan the work and repair the machines until it is operating again. The repeated case by case nested loop job flow chart is shown in Figure 1.1. Sometimes, a follow up is needed as a machine may suffer wear, tear and repair, and may not return to the original state within a desired time.

Figure 1.1. Job Flow Chart

1.3 How Downtime Affects Operations

Downtime affects the production capability of physical assets by reducing output, increasing operating costs and interfering with customer service. The primary function of most machines in industry is concerned in some ways with the need to earn revenue or to support revenue earning activities. Failures which affect the primary functions of these assets affect the revenue earning capability of the organization. Moubray (1997)

Star t

Job Ord

ers

Perform Troubleshooting

Sol ve

Follow Up Close Order

y

n y

Waiting Order


(10)

explained that the effects of the downtime are much greater than the cost of repairing the failures. This is also true for equipment in service industries such as an education center, a commercial centre, a bank and even an entertainment centre.

For example, if the air conditioning fails in a university’s class room, students will walk out, and this causes losses in time and shows the reputation of the university’s facilities. The same applies to a bank, when customers walk out. This causes business loss and also affects the bank’s reputation. If the lights fail at a ballgame, fans tend to demand the return of their money. The same applies if projectors fail at the movies.

Moubray (1997) gave five ways on how downtime can affect operations:

i) Customer service, which shows the reputation of the service providers. Poor service may cause customers to shift the business deal elsewhere,

ii) Total output, where the production people have to work extra time to recover the volume, or lost sales if the plant is already fully loaded,

iii) Product quality, where failure causes materials to deteriorate. Products, which do not reach certain quality specifications have to be either rejected or recycled, iv) Increased operating costs in addition to the direct cost of repair by increased use

of energy or it might involve switching to a more expensive alternative process, and

v) Sales, where the prices of the end products have to be increased to recover the higher production loss.


(11)

1.4 Objectives

The objectives of the present study are:

i) To provide alternative ways of computing dynamic performance measures suggested by Ahmed (2002) based on repair time data;

ii) To conduct some repair time analysis using the collective approach failure analysis proposed by Choy et al. (1996);

iii) To investigate some relationships among the risk factors in air conditioning repair work in USM using hazard models. These relationships can be obtained using statistical reliability models to identify some significant characteristics as a basis for pushing man and machine to the optimal limit;

iv) To demonstrate the usage of non-parametric and semi-parametric approaches for the estimation of the pattern of failure occurrences.

1.5 Thesis Layout

This thesis is divided into five chapters. The first chapter is an introduction to the thesis. Chapter two provides a brief insight into the existing models for maintenance. Some literature review on background researches on proactive and reactive maintenance have been discussed together with a few important reliability and availability measures.


(12)

Chapter three discusses on troubleshooting assessment for most engineering components and elements. Some background research studies on delay in troubleshooting activities have been highlighted with some extensions on non-parametric and semi-non-parametric approaches.

Chapter four covers the details on the case study analysis. This chapter describes some findings on repairing activities with the influences the risk factors which are age, qualification and experience of the technicians. Non-parametric reliability estimation can be used as a set of self monitored tools for the repair crew to implement breakthrough system, which contributes to information sharing and team work. Demonstration on semi-parametric estimation show on how to predict the risk ratio using proportional hazards model. The estimation can be a guideline for some policies and decision making process such as setting hiring target and type of training for the technical staff.

Chapter five discusses the overall conclusion and the significance of the present study. This chapter also provides some limitation of the present study and suggestions for possible modifications and additions that could be incorporated for improvement in future research.


(1)

iii) Decision support tools, such as hazards function studies, failure modes and effects analysis. (See Burhanuddin (2003b));

iv) Expert systems, such as automatic condition monitoring and remote maintenance management control.

Since then, more thorough analysis have been obtained such as failure complexity studies, root cause failure analysis, response time analysis, repair time analysis and delay time analysis. This study covers repair time analysis using statistical reliability theory and methodology. Details on the repair time framework are given in Chapter 3.

1.2 Troubleshooting Work Flow

The reliability of a machine has been defined by Bentley (1993) as the probability that the machine continues to meet the specifications, over a lifetime, subject to given environmental conditions. In contrast, the unreliability of a machine can be defined as the probability that the machine fails to meet the specifications, over a lifetime, subject to given environmental conditions. Both reliability and unreliability vary with machine lifetime intervals. The time interval for a given piece of machine in a failed state is not constant, but varies randomly according to certain circumstances.

Analysis could be simplified by separating the failure interval into two major parts, i.e. response time and repair time. When users purchase machines from a manufacturer and install them, only the initial cost is invested. When the time goes beyond the warranty period, the user will also have to pay all the costs of system failures, throughout the


(2)

lifetime of the system. Thus, repairing activities during a downtime should be at minimum timeframe along the machine lifetime. The present study proposed the analysis of troubleshooting time using reliability measures.

Practically, if any machine fails, it will have an impact on the operating envelope of plant systems and may lead to product quality problems. According to Moubray (1997), the initial clarification on the root causes in troubleshooting activities is usually the most difficult part of the investigation. For example, in some cases, finding the root cause of a failure may go undetected for days, weeks or even months. The second difficulty is isolating the specific point where defects or deviations occur. As a result, the troubleshooting process must evaluate multiple process areas until the source of the problem can be absolutely isolated and repaired. Capacity restrictions or loss of production capacity is another ideal application in troubleshooting activities. The logical step by step approach can be used, combined with its verification testing methods to isolate the complexity and maturity of the manufacturing maintenance process. All these processes will delay repair time and these can be thought of as the direct or indirect influence of either internal or external risk factors. The factors can be associated with the dependent variable, repair time intervals.

A thorough investigation may require further analysis of one or more of the production areas that precede the suspect process. Repair time reliability analysis of the past can be used to estimate and forecast strategy to face a similar kind of failure in the future. Machine failures and troubleshooting activities in the maintenance field are categorized as repetitive measures, where customers will report any breakdown issue to the service providers. Service providers will assign the job orders to the technicians. Technicians


(3)

will take the job orders, plan the work and repair the machines until it is operating again. The repeated case by case nested loop job flow chart is shown in Figure 1.1. Sometimes, a follow up is needed as a machine may suffer wear, tear and repair, and may not return to the original state within a desired time.

Figure 1.1. Job Flow Chart

1.3 How Downtime Affects Operations

Downtime affects the production capability of physical assets by reducing output, increasing operating costs and interfering with customer service. The primary function of most machines in industry is concerned in some ways with the need to earn revenue or to support revenue earning activities. Failures which affect the primary functions of these assets affect the revenue earning capability of the organization. Moubray (1997)

Star t

Job Ord

ers

Perform Troubleshooting

Sol ve

Follow Up Close Order

y

n y

Waiting Order


(4)

explained that the effects of the downtime are much greater than the cost of repairing the failures. This is also true for equipment in service industries such as an education center, a commercial centre, a bank and even an entertainment centre.

For example, if the air conditioning fails in a university’s class room, students will walk out, and this causes losses in time and shows the reputation of the university’s facilities. The same applies to a bank, when customers walk out. This causes business loss and also affects the bank’s reputation. If the lights fail at a ballgame, fans tend to demand the return of their money. The same applies if projectors fail at the movies.

Moubray (1997) gave five ways on how downtime can affect operations:

i) Customer service, which shows the reputation of the service providers. Poor service may cause customers to shift the business deal elsewhere,

ii) Total output, where the production people have to work extra time to recover the volume, or lost sales if the plant is already fully loaded,

iii) Product quality, where failure causes materials to deteriorate. Products, which do not reach certain quality specifications have to be either rejected or recycled, iv) Increased operating costs in addition to the direct cost of repair by increased use

of energy or it might involve switching to a more expensive alternative process, and

v) Sales, where the prices of the end products have to be increased to recover the higher production loss.


(5)

1.4 Objectives

The objectives of the present study are:

i) To provide alternative ways of computing dynamic performance measures suggested by Ahmed (2002) based on repair time data;

ii) To conduct some repair time analysis using the collective approach failure analysis proposed by Choy et al. (1996);

iii) To investigate some relationships among the risk factors in air conditioning repair work in USM using hazard models. These relationships can be obtained using statistical reliability models to identify some significant characteristics as a basis for pushing man and machine to the optimal limit;

iv) To demonstrate the usage of non-parametric and semi-parametric approaches for the estimation of the pattern of failure occurrences.

1.5 Thesis Layout

This thesis is divided into five chapters. The first chapter is an introduction to the thesis. Chapter two provides a brief insight into the existing models for maintenance. Some literature review on background researches on proactive and reactive maintenance have been discussed together with a few important reliability and availability measures.


(6)

Chapter three discusses on troubleshooting assessment for most engineering components and elements. Some background research studies on delay in troubleshooting activities have been highlighted with some extensions on non-parametric and semi-non-parametric approaches.

Chapter four covers the details on the case study analysis. This chapter describes some findings on repairing activities with the influences the risk factors which are age, qualification and experience of the technicians. Non-parametric reliability estimation can be used as a set of self monitored tools for the repair crew to implement breakthrough system, which contributes to information sharing and team work. Demonstration on semi-parametric estimation show on how to predict the risk ratio using proportional hazards model. The estimation can be a guideline for some policies and decision making process such as setting hiring target and type of training for the technical staff.

Chapter five discusses the overall conclusion and the significance of the present study. This chapter also provides some limitation of the present study and suggestions for possible modifications and additions that could be incorporated for improvement in future research.