Expectation-Maximization EM Algorithm Parameter Estimation

55

3.8.2 Expectation-Maximization EM Algorithm

The Expectation-Maximization algorithm was originally developed by Dempster et al. 1977 for finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values Bilmes, 1998. To demonstrate this algorithm using our data, we start with the general notation of the EM algorithm given by Bilmes 1998. Assume data y is observed and called as incomplete data and x denotes the hidden or missing value. A complete dataset exists, which is , x y z  and has a joint density function as follows: | , | | , |     x p x y p x y p z p   3-30 Here we assume both x and y are continuous. This joint density function describes a joint relationship between the missing and observed values. With this density function, we can define a likelihood function of dataset , x y z  as | , , | |    x y p x y L z L   3-31 which is called the complete-data likelihood. The EM algorithm consists of two steps, the first being to find the expected value of the complete data log likelihood | , log  x y p with respect to the unknown state, x , given the observed data y and the current parameter estimates, j  . We define this expected log likelihood as     , | | , log , j j x x y p E Q      3-32 where   dx x p x y p x x y p E X x j j | | , log , | | , log        3-33 and where X is the complete set of x . The second step is to find the maximum of the expectation that we have computed in the first step, that is , max j Q   . To obtain 56 1  j Q , these two steps are repeated until tol Q Q j j    | | 1 , where tol is a pre-set tolerance level. Applying this to our case study, given the observed value, the complete data likelihood has the following form, in which , | 1    i i y p describes the observation values and , | 1    i i x P describes the hidden states. Note i X is a discrete random variable in our case. To calculate the estimate parameter using one life cycle of data using the EM approach, the algorithm is as follows: Start: an initial estimate of  E-step: calculate the conditional expectation , j Q   =       T i j i i i i x P y p 1 1 1 , | } , | log{                        T i j i i N x x N x i i i i i i x P x P x x P x y p i i i 1 1 1 1 1 2 1 1 , | , | , | , | log 1 1     where   2 1 ,     is a set of unknown parameters, T is the number of monitoring checks during the cycle, N is the number of states, i y is the condition monitoring reading at time i t and   1 , 2 1 1 ..., ,     i i y y y . M-step: maximize , j Q   to obtain the next estimate, , max j Q   to obtain 1  j Q . Similarly, if we have m life cycles, the new algorithm is as follows Start: an initial estimate of  E-step: calculate the conditional expectation , j Q   =        m j T i j ji ji ji ji j x P y p 1 1 1 1 , | } , | log{                         m j T i j ji ji N x x N x ji ij ji ji ji ji j i i i x P x P x x P x y p 1 1 1 1 1 1 2 1 1 , | , | , | , | log 1 1     57 where m is the number of complete life cycles,   2 1 ,     is a set of unknown parameters, j T is the number of monitoring checks of the jth cycle, ji y is the condition monitoring reading at i t for the jth cycle and the ith monitoring, ji x and 1  ji x are the states of the system at i t and 1  i t for the jth cycle and ith monitoring respectively, and   1 , 2 1 1 ..., ,     ji j j ji y y y . M-step: maximize , j Q   to obtain the next estimate, , max j Q   to obtain 1  j Q . Applying this to our simulated data without using failure information, we obtained the results in Table 3-4 below. Estimated Value 1 ˆ = 0.2033  ˆ = 3.1886 bˆ = 0.8013 1 ˆ  = 0.0239 2 ˆ  = not available True Value 1  = 0.2176  = 4.0 b = 0.8 1  = 0.05 2  = 0.025 Table 3-4: The estimated parameters and true values , j Q   In our expectations, the EM should be better since, in theory, we cannot observe the hidden state and the algorithm is guaranteed to converge to a local maximum of the likelihood function Dempster et al., 1977. Surprisingly, the EM procedure explained above produces almost the same result as the ordinary likelihood function without the failure information; see Table 3-2 . It is believed that the complete likelihood function in the E-step is not enough to describe the data. Therefore, the function for the E-step was added with failure information, as follows: , j Q   =         N i j i i i i x P y p 1 1 1 , | } , | log{   +log | 3 1 N N X P    3-34 where 1  N t is the failure point. When the same procedure was reapplied for m sets of data, it generated a set of new estimates, shown in Table 3-5 below. 58 Estimated Value 1 ˆ = 0.2032 ˆ = 3.1886 bˆ = 0.8013 1 ˆ  = 0.0239 2 ˆ = 0.016 True Value 1  = 0.2176  = 4.0 b = 0.8 1  = 0.05 2  = 0.025 Table 3-5: The estimated parameters and true values with modified , j Q   Again, the EM procedure produces almost the same results as the ordinary likelihood function with failure information. This is because the approach taken to formulate the EM algorithm is almost the same as for the ordinary likelihood function, except that we multiplied it with | 1   i i x p in equations 3-34 to boost the optimisation calculation. We were able to estimate 1 ˆ  simply because we had the information about when the time 1 l began and ended. In contrast, we were unable to estimate 2 ˆ because we did not have enough information concerning when the time 2 l ended. However, the failure information provides new data on the ending of the time 2 l , allowing the parameter 2 ˆ  to be estimated. In practice, failure information is difficult to obtain due to preventive replacements, but expert judgments Zhang, 2004 can be used to overcome this problem.

3.9 Goodness-of