Maximum Likelihood Programming in R

  

Maximum Likelihood Programming in R

  Marco R. Steenbergen

  

Department of Political Science

University of North Carolina, Chapel Hill

  January 2006

  Contents

  1 Introduction

  2

  2 Syntactic Structure 2

2.1 Declaring the Log-Likelihood Function . . . . . . . . . . . . . . . . . . . . . . .

  2

2.2 Optimizing the Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4

  3 Output

  5

  4 Obtaining Standard Errors

  5

  5 Test Statistics and Output Control

  7

  1 Introduction

The programming language R is rapidly gaining ground among political method-

ologists. A major reason is that R is a flexible and versatile language, which

makes it easy to program new routines. In addition, R algorithms are generally

very precise.

  R is well-suited for programming your own maximum likelihood routines.

Indeed, there are several procedures for optimizing likelihood functions. Here I

shall focus on the optim command, which implements the BFGS and L-BFGS-B

1

algorithms, among others. Optimization through optim is relatively straight-

forward, since it is usually not necessary to provide analytic first and second

derivatives. The command is also flexible, as likelihood functions can be de-

clared in general terms instead of being defined in terms of a specific data set.

  2 Syntactic Structure

Estimating likelihood functions entails a two-step process. First, one declares

the log-likelihood function, which is done in general terms. Then one optimizes

the log-likelihood function, which is done in terms of a particular data set. The

log-likelihood function and optimization command may be typed interactively

into the R command window or they may be contained in a text file. I would

recommend saving log-likelihood functions into a text file, especially if you plan

on using them frequently.

  2.1 Declaring the Log-Likelihood Function

  

The log-likelihood function is declared as an R function. In R, functions take

at least two arguments. First, they require a vector of parameters. Second, they

require at least one data object. Note that other arguments can be added to

this if they are necessary. The data object is a generic placeholder for data. In

the optim command, specific data are substituted for this placeholder.

  After the arguments are declared, the actual log-likelihood is expressed and demarcated by {}. Thus, we have the following syntax: name<-function(pars,object) { declarations logl<-loglikelihood function return(-logl) }

Here name is the name of the log-likelihood function, pars is the name of the

parameter vector, and object is the name of the generic data object. The in-

structions placed between brackets define the log-likelihood function. At a mini-

mum, there should be two elements here: (1) the declaration of the log-likelihood

1 The optim command also includes Nelder-Mead, conjugate gradients, and simulated an-

nealing algorithms. Other optimization routines include optimize, nlm, and constrOptim.

  These procedures are not discussed here.

  

function, which is named logl, and (2) the return of negative one times the log-

2

likelihood. In addition, it may be necessary to make other declarations.

  

These may include partitioning a parameter vector or declaring temporary vari-

ables that figure in the log-likelihood function. The application of this syntax

will be clarified using several examples.

  

Example 1: Consider the Poisson log-likelihood function, which is given by

  X

i i

  X l = y ln(µ) − nµ − ln(y !) i i

Since the last term does not include the parameter, µ, it can be safely ignored.

  Thus, the kernel of the log-likelihood function is

  X i l = y ln(µ) − nµ i We can program this function using the following syntax: poisson.lik<-function(mu,y)

{

n<-nrow(y) logl<-sum(y)*log(mu)-n*mu return(-logl) }

  

Here poisson.lik is the name of the log-likelihood function; this name will be

used in the optim command. The “vector” of parameters is called mu; this is

not really a vector since there is only one parameter that needs to be estimated.

Further, y is the placeholder for the data. Since the log-likelihood function

requires knowledge of the sample size, we obtain this using n<-nrow(y). The

expression for logl contains the kernel of the log-likelihood function. Finally,

we ask R to return -1 times the log-likelihood function.

  

Example 2: Imagine that we have a sample that was drawn from a normal

2

distribution with unknown mean, µ, and variance, σ . The objective is to

estimate these parameters. The normal log-likelihood function is given by

  X 2

  1 i 2

l = −.5n ln(2π) − .5n ln(σ ) − (y − µ)

2i We can program this function in the following way: normal.lik1<-function(theta,y) { mu<-theta[1] sigma2<-theta[2] 2 n<-nrow(y)

  We ask for −1 × l because the optim command minimizes a function by default. Mini- mization of −l is the same as maximization of l, which is what we want. logl<- -.5*n*log(2*pi) -.5*n*log(sigma2) - (1/(2*sigma2))*sum((y-mu)**2) return(-logl) }

  

Here theta is a vector containing the two parameters of interest. We declare the

elements of this vector in the first two lines of the bracketed part of the program.

Specifically, the first element (theta[1]) is equal to µ, while the second element

2

(theta[2]) is equal to σ . The remainder of the program sets the sample size,

specifies the log-likelihood function, and asks R to return the negative of this

function.

  Note that the normal log-likelihood function may also be written as

  X

l = −n ln(σ) + ln[φ(z i )]

i where z i = (y i − µ)/σ. This can be programmed using normal.lik2<-function(theta,y) { mu<-theta[1] sigma<-theta[2] n<-nrow(y) z<-(y-mu)/sigma logl<- -n*log(sigma) - sum(log(dnorm(z))) return(-logl) }

where dnorm is R’s standard normal density function. Here we estimate σ rather

2

than σ , but it is easy to move back and forth between these parameterizations.

  2.2 Optimizing the Log-Likelihood

  

Once the log-likelihood function has been declared, then the optim command

can be invoked. The minimal specification of this command is optim(starting values, log-likelihood, data)

Here starting values is a vector of starting values, log-likelihood is the

name of the log-likelihood function that you seek to maximize, and data de-

clares the data for the estimation. This specification causes R to use the Nelder-

Mead algorithm. If you want to use the BFGS algorithm you should include

the method="BFGS" option. For the L-BFGS-B algorithm you should declare

method="L-BFGS-B" . The current specification does not produce standard er-

rors. A procedure for obtaining standard errors will be discussed later in this

3 report. 3 There are many other options for the optim command. For a detailed description see, for example, http://jsekhon.fas.harvard.edu/stats/html/optim.html.

  

Example 3: Imagine that we have a vector data that consists of draws from

a Poisson distribution with unknown µ. We seek to estimate this parameter and

have already declared the log-likelihood function as poisson.lik. Estimation

using the BFGS algorithm now commences as follows optim(1,poisson.lik,y=data,method="BFGS")

Here 1 is the starting value for the algorithm. Since the log-likelihood function

refers to generic data objects as y, it is important that the vector data is equated

with y.

  Example 4: Given a vector of data, y, the parameters of the normal distrib- ution can be estimated using optim(c(0,1),normal.lik1,y=y,method="BFGS")

This is similar to Example 3 with the exception of the starting values. Since

the normal distribution contains two parameters, two starting values need to be

2

declared. Here we set the starting value for ˆ µ to 0 and the starting value for ˆ σ

to 1. These two values are “bundled” using the c or concatenation operator.

  3 Output

The optim specifications discussed so far will produce several pieces of output.

  These come under various headings: 1. $par: This shows the MLEs of the parameters.

  2. $value: This shows the value of the log-likelihood function at the MLEs.

  If you asked R to return -1 times the log-likelihood function, then this is the value reported here. 3. $counts: A vector that reports the number of calls to the log-likelihood function and the gradient. 4. $convergence: A value of 0 indicates normal convergence. If you see a 1 reported, this means that the iteration limit was exceeded. This limit is set to 10000 by default. 5. $message: This shows warnings of any problems that occurred during optimization. Ideally, one would like to see NULL here, since this indicates that there are no warnings.

  4 Obtaining Standard Errors

The optim command allows one to compute standard errors based on the ob-

4

served Fisher information matrix. This requires that we obtain the Hessian

4 Unlike Stata, standard errors, test statistics, and confidence intervals are not computed by default in the optim command.

  

matrix, which can be done by adding hessian=T or hessian=TRUE to the com-

mand. Since we will have to perform operations on the Hessian, it is also

important that we store the results from the estimation into an object. The

following linear regression example illustrates how to do this.

  

Example 5: Imagine that we are interested in estimating a simple linear

regression for some simulated data. First, we create the data matrix for the

predictors X<-cbin(1,runif(100))

Here we draw 100 observations from a uniform distribution with limits 0 and 1.

These data are bound together with the constant (1). Next, we postulate a set

of values for the true parameters: theta.true<-c(2,3,1) 1 Here, the first element is β , the second element is β , and the last element is 2 σ . We can now create the dependent variable: y<-X%*%theta.true[1:2] + rnorm(100)

where rnorm(100) generates the disturbance by drawing 100 values from the

standard normal distribution. We now have the data on the dependent variable

and predictor.

  

The next step is to declare the log-likelihood function. The following syntax

shows one way to do this. ols.lf<-function(theta,y,X)

{

n<-nrow(X) k<-ncol(X) beta<-theta[1:k] sigma2<-theta[k+1] e<-y-X%*%beta logl<- -.5*n*log(2*pi)-.5*n*log(sigma2)- ((t(e)%*%e)/(2*sigma2)) return(-logl) } 2 Here theta contains both the elements of β and σ . The program declares the 2

first k elements of theta to be β and the k + 1st element to be σ . The vector

e

contains the residuals and t(e)%*%e in the log-likelihood function causes R

to compute the sum of squared residuals.

We can now start the optimization of the log-likelihood function and store the

results in an object named p (any other name would have worked just as well):

p<-optim(c(1,1,1),ols.lf,method="BFGS",hessian=T,y=y,X=X)

  2 where c(1,1,1) sets the starting values for ˆ β , ˆ β 1 , and ˆ σ equal to 1. We can 5

now invert the Hessian to obtain the observed Fisher information matrix. This

  Hessian is stored as p$hessian and it can be inverted using OI<-solve(p$hessian)

The square root of the diagonal elements are then the standard errors, corre-

1 2 sponding to ˆ β , ˆ β , and ˆ σ , respectively. These can be obtained by typing se<-sqrt(diag(OI))

5 Test Statistics and Output Control

  

With the standard errors in hand, Wald test statistics and their associated p-

values can be computed. The following syntax will accomplish this task for the

regression model of Example 5. Example 5 Cont’d: The Wald test statistic is given by the ratio of the

estimates and their standard errors. The associated p-value can be computed

by referring to a Student’s t-distribution with degrees of freedom equal to the

number of rows minus the number of columns in X. t<-p$par/se pval<-2*(1-pt(abs(t),nrow(X)-ncol(X))) results<-cbind(p$par,se,t,pval) results(colnames)<-c("b","se","t","p") results(rownames)<-c("Const","X1","Sigma2") print(results,digits=3)

  

The first line generates the test statistics, the second line computes the asso-

ciated p-values, while the third line brings together the estimates, estimated

standard errors, test statistics, and p-values. The fourth line creates a set of

column headers for the output, while the fifth line creates a set of row headers.

Finally, the last line causes R to print the results to the screen, with a precision

of 3 digits.

  5 1 The observed Fisher information is equal to (−H) . The reason that we do not have to

multiply the Hessian by -1 is that all of the evaluation has been done in terms of -1 times the

log-likelihood. This means that the Hessian that is produced by optim is already multiplied by -1. Example of MLE Computations, using R

  First of all, do you really need R to compute the MLE? Please note that MLE in many cases have explicit formula. Second of all, for some common distributions even though there are no explicit formula, there are standard (existing) routines that can compute MLE. Example of this catergory include Weibull distribution with both scale and shape parameters, logistic regres- sion, etc. If you still cannot find anything usable then the following notes may be useful.

  We start with a simple example so that we can cross check the result. 1 2 n 2 Suppose the observations X , X , ..., X are from N (µ, σ ) distribution (2 2 parameters: µ and σ ).

  The log likelihood function is X 2 (X i − µ) 2

  − − 1/2 log 2π − 1/2 log σ + log dX i 2

  (actually we do not have to keep the terms −1/2 log 2π and log dX i since they are constants. In R software we first store the data in a vector called xvec xvec <- c(2,5,3,7,-3,-2,0) # or some other numbers then define a function (which is negative of the log lik) fn <- function(theta) { sum ( 0.5*(xvec - theta[1])^2/theta[2] + 0.5* log(theta[2]) ) } where there are two parameters: theta[1] and theta[2]. They are compo- nents of a vector theta. then we try to find the max (actually the min of negative log lik) nlm(fn, theta <- c(0,1), hessian=TRUE) or optim(theta <- c(0,1), fn, hessian=TRUE)

  You may need to try several starting values (here we used c(0,1)) for the theta. ( i.e. theta[1]=0, theta[2]=1. ) Actual R output session:

  > xvec <- c(2,5,3,7,-3,-2,0) # you may try other values > fn

  # I have pre-defined fn function(theta) { sum( 0.5*(xvec-theta[1])^2/theta[2] + 0.5* log(theta[2]) ) } > nlm(fn, theta <- c(0,2), hessian=TRUE) # minimization $minimum [1] 12.00132 $estimate [1] 1.714284 11.346933 $gradient [1] -3.709628e-07 -5.166134e-09 $hessian

  [,1] [,2] [1,] 6.169069e-01 -4.566031e-06 [2,] -4.566031e-06 2.717301e-02 $code [1] 1 $iterations [1] 12 > mean(xvec) [1] 1.714286 # this checks out with estimate[1] > sum( (xvec -mean(xvec))^2 )/7 [1] 11.34694 # this also checks out w/ estimate[2] > output1 <- nlm(fn, theta <- c(2,10), hessian=TRUE) > solve(output1$hessian) # to compute the inverse of hessian

  # which is the approx. var-cor matrix

  [,1] [,2] [1,] 1.6209919201 3.028906e-04 [2,] 0.0003028906 3.680137e+01 > sqrt( diag(solve(output1$hessian)) ) [1] 1.273182 6.066413 > 11.34694/7 [1] 1.620991 > sqrt(11.34694/7) [1] 1.273182 # st. dev. of mean checks out > optim( theta <- c(2,9), fn, hessian=TRUE) # minimization, diff R function $par [1] 1.713956 11.347966 $value [1] 12.00132 $counts function gradient

  45 NA $convergence [1] 0 $message NULL $hessian

  [,1] [,2] [1,] 6.168506e-01 1.793543e-05 [2,] 1.793543e-05 2.717398e-02 2 Comment: We know long ago the variance of ¯ x can be estimated by s /n. 2 2

  (or replace s by the MLE of σ ) (may be even this is news to you? then you need to review some basic stat).

  But how many of you know (or remember) the variance/standard devia- 2 2 tion of the MLE of σ (or s )? (by above calculation we know its standard deviation is approx. equal to 6.066413) How about the covariance between ¯ x and v? here it is approx. 0.0003028

  (very small). Theory say they are independent, so the true covariance should equal to 0.

  

Example of inverting the (Wilks) likelihood ra-

tio test to get confidence interval

1

2 n 2 Suppose independent observations X , X , ..., X are from N (µ, σ ) distribu- tion (one parameter: σ). µ assumed known, for example µ = 2.

  The log likelihood function is X i 2 (X − µ) 2

  − − 1/2 log 2π − 1/2 log σ + log dX i 2

  We know the log likelihood function is maximized when s P i − 2 (x µ)

  σ = n This is the MLE of σ.

  The Wilks statistics is H max lik − 2 log = 2[log max Lik − log max Lik] H max lik

  In R software we first store the data in a vector called xvec xvec <- c(2,5,3,7,-3,-2,0) # or some other numbers then define a function (which is negative of log lik) (and omit some con- stants) fn <- function(theta) { sum ( 0.5*(xvec - theta[1])^2/theta[2] + 0.5* log(theta[2]) ) }

  In R we can compute the Wilks statistics for testing H : σ = 1.5 vs H a : σ 6= 1.5 as follows: assume we know µ = 2 then the MLE of σ is mleSigma <- sqrt( sum( (xvec - 2)^2 ) /length(xvec)) The Wilks statistics is

  WilksStat <- 2*( fn(c(2,1.5^2)) - fn(c(2,mleSigma^2)) ) The actual R session:

  > xvec <- c(2,5,3,7,-3,-2,0) > fn function(theta) { sum ( 0.5*(xvec-theta[1])^2/theta[2] + 0.5* log(theta[2]) ) } > mleSigma <- sqrt((sum((xvec - 2)^2))/length(xvec)) > mleSigma [1] 3.380617 > 2*( fn(c(2,1.5^2)) - fn(c(2,mleSigma^2)) ) [1] 17.17925

  This is much larger then 3.84 ( = 5% significance of a chi-square distri- bution), so we should reject the hypothesis of σ = 1.5. After some trial and error we find

  > 2*( fn(c(2,2.1635^2)) - fn(c(2,mleSigma^2)) ) [1] 3.842709 > 2*( fn(c(2,6.37^2)) - fn(c(2,mleSigma^2)) ) [1] 3.841142

  So the 95% confidence interval for σ is (approximately) [2.1635, 6.37] 2 We also see that the 95% confidence Interval for σ is 2 2

  [2.1635 , 6.37 ] sort of invariance property (for the confidence interval). We point out that the confidence interval from the Wald construction do not have invariance property. The Wald 95% confidence interval for sigma is (using formula we derived in the midterm exam)

  3.380617 +- 1.96*3.380617/sqrt(2*length(xvec)) = [1.609742, 5.151492] 2 The Wald 95% confidence interval for σ is (homework) (3.380617)^2 +- 1.96* ... Define a function (the log lik of the multinomial distribution) > loglik <- function(x, p) { sum( x * log(p) ) }

  For the vector of observation x (integers) and probability proportion p (add up to one)

  We know the MLE of the p is just x/N where N is the total number of trials = sumx i . a Therefore the −2[log lik(H ) − log lik(H + H )] is

  > -2*(\loglik(c(3,5,8), c(0.2,0.3,0.5))-loglik(c(3,5,8),c(3/16,5/16,8/16))) [1] 0.02098882 > This is not significant (not larger then 5.99) The cut off values are obtained as follows: > qchisq(0.95, df=1) [1] 3.841459 > qchisq(0.95, df=2) [1] 5.991465 > -2*(loglik(c(3,5,8),c(0.1,0.8,0.1))-lik(c(3,5,8),c(3/16,5/16,8/16))) [1] 20.12259 This is significant, since it is larger then 5.99.

  Now use Pearson’s chi square: > chisq.test(x=c(3,5,8), p= c(0.2,0.3,0.5))

  Chi-squared test for given probabilities data: c(3, 5, 8) X-squared = 0.0208, df = 2, p-value = 0.9896 Warning message: Chi-squared approximation may be incorrect in: chisq.test(x = c(3, 5, 8), p = c( 0.2, 0.3, 0.5))

  > chisq.test(x=c( 3,5,8), p= c(0.1,0.8,0.1)) Chi-squared test for given probabilities data: c(3, 5, 8)

  X-squared = 31.5781, df = 2, p-value = 1.390e-07 Warning message: Chi-squared approximation may be incorrect in: chisq.test(x = c(3, 5, 8), p = c( 0.1, 0.8, 0.1))

  1 t-test and approximate Wilks test

  Use the same function we defined before but now we always plug-in the MLE 2 for the (nuisance parameter) σ . As for the mean µ, we plug the MLE for one and plug the value specified in H in the other (numerator).

  > xvec <- c(2,5,3,7,-3,-2,0) > t.test(xvec, mu=1.2)

  One Sample t-test data: xvec t = 0.374, df = 6, p-value = 0.7213 alternative hypothesis: true mean is not equal to 1.2 95 percent confidence interval:

  • 1.650691 5.079262 sample estimates: mean of x

  1.714286 Now use Wilks likelihood ratio:

  > mleSigma <- sqrt((sum((xvec - mean(xvec) )^2))/length(xvec)) > mleSigma2 <- sqrt((sum((xvec - 1.2)^2))/length(xvec)) > 2*( fn(c(1.2,mleSigma2^2)) - fn(c(mean(xvec),mleSigma^2)) ) [1] 0.1612929

  > pchisq(0.1612929, df=1) [1] 0.3120310

  P-value is 0.312031

1 Optimization using the optim function

  Consider a function f (x) of a vector x. Optimization problems are concerned with the

  ⋆ ⋆

  task of finding x such that f (x ) is a local maximum (or minimum). In the case of maximization,

  ⋆

  x = argmax f (x) and in the case of minimization,

  ⋆

  x = argmin f (x) Most statistical estimation problems are optimization problems. For example, if f is the

  ⋆

  likelihood function and x is a vector of parameter values, then x is the maximum like- lihood estimator (MLE), which has many nice theoretical properties.

  ⋆

  When f is the posterior distribution function, then x is a popular bayes estimator. Other well known estimators, such as the least squares estimator in linear regression are opti- mums of particular objective functions. We will focus on using the built-in R function optim to solve minimization problems, so if you want to maximize you must supply the function multiplied by -1. The default method for optim is a derivative-free optimization routine called the Nelder-Mead simplex algorithm. The basic syntax is optim(init, f) where init is a vector of initial values you must specify and f is the objective function.

  There are many optional argument– see the help file details If you have also calculated the derivative and stored it in a function df, then the syntax is optim(init, f, df, method="CG") There are many choices for method, but CG is probably the best. In many cases the derivative calculation itself is difficult, so the default choice will be preferred (and will be used on the homework).

  With some functions, particularly functions with many minimums, the initial values have a great impact on the converged point.

1.2 One dimensional examples

  2

  −(x−2)

  Example 1: Suppose f (x) = e . The derivative of this function is

1.0 Plot of f(x) = sin(x*cos(x))

  0.5 f(x) −0.5

  0.0

  −1.0 2 4 6 8 10 x

  Figure 1: sin(xcos(x)) for x ∈ (0, 10) # we supply negative f, since we want to maximize. f <- function(x) -exp(-( (x-2)^2 )) ######### without derivative # I am using 1 at the initial value # $par extracts only the argmax and nothing else optim(1, f)$par ######### with derivative df <- function(x) -2*(x-2)*f(x) optim(1, f, df, method="CG")$par Notice the derivative free method appears more sensitive to the starting value and gives a warning message. But, the converged point looks about right when the starting value is reasonable. Example 2: Suppose f (x) = sin(xcos(x)). This function has many local optimums. Let’s see how optim finds minimums of this function which appear to occur around 2.1, 4.1, 5.8, and several others. f <- function(x) sin(x*cos(x)) optim(2, f)$par optim(4, f)$par

  2500 2000 z 1500 1000 500 2 1 1 y −1 −2 −1 x Figure 2: Rosenbrock function for x ∈ (−2, 2) and y ∈ (−1, 3).

1.3 Two dimensional examples

  2

  2

  2 Example 3:

  ) , which is called the Rosenbrock Let f (x, y) = (1 − x) + 100(y − x function. Let’s plot the function. f <- function(x1,y1) (1-x1)^2 + 100*(y1 - x1^2)^2 x <- seq(-2,2,by=.15) y <- seq(-1,3,by=.15) z <- outer(x,y,f) persp(x,y,z,phi=45,theta=-45,col="yellow",shade=.00000001,ticktype="detailed")

  2 This function is strictly positive, but is 0 when y = x , and x = 1, so (1, 1) is a min-

  imum. Let’s see if optim can figure this out. When using optim for multidimensional optimization, the input in your function definition must be a single vector. f <- function(x) (1-x[1])^2 + 100*(x[2]-x[1]^2)^2 # starting values must be a vector now optim( c(0,0), f )$par [1] 0.9999564 0.9999085

  2

  2

  2

  2 Example 4: Let f (x, y) = (x + (x + y , which is called Himmelblau’s

  • y − 11) − 7) function. The function is plotted from belowin figure 3 by:

  200 300 z 400 500 100 −4 −2 2 4 y 2 −2 4 −4 x Figure 3: Himmelblau function for x ∈ (−4, 4) and y ∈ (−4, 4).

  There appear to be four “bumps” that look like minimums in the realm of (-4,-4), (2,-2), (2,2) and (-4,4). Again this function is strictly positive so the function is minimized when

  2

  2

  x + y − 11 = 0 and x + y − 7 = 0. f <- function(x) (x[1]^2 + x[2] - 11)^2 + (x[1] + x[2]^2 - 7)^2 optim(c(-4,-4), f)$par [1] -3.779347 -3.283172 optim(c(2,-2), f)$par [1] 3.584370 -1.848105 optim(c(2,2), f)$par [1] 3.000014 2.000032 optim(c(-4,4),f)$par [1] -2.805129 3.131435 which are indeed the true minimums. This can be checked by seeing that these inputs correspond to function values that are about 0.

  2 Using optim to fit a probit regression model Suppose we observe a binary variable y i 1,i , ..., x d,i on n units.

  ∈ {0, 1} and d covariates x We assume that y i i )

  ∼ Bernoulli(p where p i = Φ(β + β x + ... + β d x d,i ) (1)

  

1 1,i

  where Φ is the standard normal CDF. This is called the probit regression model and is considered an alternative to logistic regression. Notice this is basically the same as logistic

  −1

  regression except the link function is Φ (p) instead of log(p/(1 − p)). For notational simplicity, define

  X i = (1, x , ..., x d,i )

  

1,i

  that is, the row vector of covariates for subject i. Also define

  ′

  β = (β , β , ..., β d )

  

1

  the column vector of regression coefficients. Then (1) can be re-written as p i = Φ(X i β) (2) Instead of doing maximum likelihood estimation, we will place a multivariate normal prior

  d ), where I d is the d + 1 dimensional identity matrix, and

  • 1 +1

  on β. That is β ∼ N(0, cI c = 10. This means that we are assuming that the β’s are each independent N(0,10) random variables. The goal of this exercise is to write a program that determines estimated coefficients, ˆ β for this model by maximizing the posterior density function. This estimator is called the posterior mode or the maximum aposteriori (MAP) estimator. Once we finish writing this program we will fit the probit regression model to estimate the outcome of the 1988 presidential based on the ctools dataset.

  Step 1: Determine the log-likelihood for β We know that y i i ). This means

  ∼ Bernoulli(p

  y i i 1−y

  p(y ) = p )

  

i i

i (1 − p

  1

  i.e. p(1) = p ) = p . Since we know that p has the

  i i i i i + (1 − p , and similarly p(0) = 1 − p

  structure defined by (2), this becomes

  y i i 1−y

  p(y i i β) i β)) |β) = Φ(X (1 − Φ(X which is the likelihood for subject i. So, the log-likelihood for subject i is

  ℓ (β) = log(p(y

  i i

  |β)) = y i log (Φ(X i i i β))

  β)) + (1 − y ) log (1 − Φ(X Φ(X β)

  i i β)) + y i log

  = log (1 − Φ(X

  i β)

  1 − Φ(X By independence, the log-likelihood for the entire sample is

  n

  X L(β) = ℓ (β)

  i i =1 n

  X Φ(X i β)

  = i β)) + y i log log (1 − Φ(X

  i β)

  1 − Φ(X

  i =1

2 Determine the log-posterior

  Remember the posterior density is p(β|y), and that p(β|y) ∝ exp(L(β)) · p(β) | {z } |{z}

  likelihood prior

  therefore the log-posterior is log(p(β|y)) = const + L(β) + log(p(β)) We want to optimize log(p(β|y)). The constant term does not effect the location of the maximum, so we need only maximize L(β) + log(p(β)). Recall that β , ..., β d are assumed iid N (0, 10), so for each j = 0, ..., d. Therefore

  2

  log(p(β j /20 )) = const − β

  Again, dropping the constant, and using the fact that the β’s are independent,

  d d

  X X

  1

  

2

  2

  log(p(β)) = β

  j

  −β /20 = −

  20

  j =0 j =0

  Finally the log-posterior is log(p(β|y)) = L(β) + log(p(β))

  n d

  X X Φ(X i β)

  1

  2

  = i β)) + y i log β

  j

  log (1 − Φ(X −

  i β)

  20 1 − Φ(X

  i j =1

  =0

  Step 3: Write a generic function that minimizes the negative log posterior The following R code does this.

  # Y is the binary response data. # X is the covariate data, each row is the response data # for a single subject. If an intercept is desired, there # should be a column of 1’s in X # V is the prior variance (10 by default) # when V = Inf, this is maximum likelihood estimation. posterior.mode <- function(Y, X,V=10) {

  # sample size n <- length(Y) # number of betas d <- ncol(X) # the log posterior as a function of the vector beta for subject i data log.like_one <- function(i, beta) {

  # p_{i}, given X_{i} and beta Phi.Xb <- pnorm( sum(X[i,] * beta) ) loglike <- function(beta) {

  L <- 0 for(ii in 1:n) L <- L + log.like_one(ii,beta) return(L)

  } # *negative* log posterior of the entire sample log.posterior <- function(beta) -loglike(beta) + (1/(2*V))*sum(beta^2) # return the beta which optimizes the log posterior. # initial values are arbitrarily chosen to be all 0’s. return( optim( rep(0,d), log.posterior)$par )

  } We will generate some fake data such that

  P (Y i i,

  1 + 2.8x i, 2 )

  = 1) = Φ (.43 − 1.7x to demonstrate that this works.

  # The covariates are standard normal X <- cbind( rep(1, 20), rnorm(20), rnorm(20) ) Y <- rep(0,20) Beta <- c(.43, -1.7, 2.8) for(i in 1:20) {

  # p_i pp <- pnorm( sum(Beta * X[i,]) ) if( runif(1) < pp ) Y[i] <- 1

  } # fit the model with prior variance equal to 10 posterior.mode(Y,X) [1] 0.8812282 -2.0398078 2.4052765 # very small prior variance posterior.mode(Y,X,.1) [1] 0.07609242 -0.47507906 0.44481367

  We can see that when the prior variance is smaller, the coefficients are shrunken more toward 0. When the prior variance is very large, we get the smallest amount of shrinkage. Step 4: 1988 elections data analysis We have data from 8 surveys where the responses are ’1’ if the person voted for George Bush and ’0’ otherwise. We have demographic predictors: age, education, female, black. Each of these predictors are categorical with 4, 4, 2, and 2 levels, respectively, so we effectively have 10 predictors (viewing the indicator of each level as a distinct predictor). For each of the 8 polls, we will use our function to estimate β, and therefore the pre- dicted probabilities for each “class” of individuals– a class being defined as a particular configuration of the predictor variables. Call this quantity p c :

  ˆ p c = Φ(X c β) Since not all classes are equally prevalent in the population, we obtain N c , the number of people in each class from the 1988 census data, and calculate

  P K p c N c

  c =1

  p ˆ =

  bush

  P K N c

  c =1

  which a weighted average and is a reasonable estimate of the proportion of the population that actually voted for George Bush. First we read in the data: library(’foreign’) A <- read.dta(’polls.dta’) D <- read.dta(’census88.dta’) Since we will need it for each poll, we might as well get N c for each class first. There are

  age

  4 · 4 · 2 · 2 = 64 different classes, comprised of all possible combinations of {1, 2, 3, 4} × edu sex race . {1, 2, 3, 4} × {0, 1} × {0, 1} # storage for the class labels and total numbers C <- matrix(0, 64, 2) k <- 1 for(age in 1:4) { for(edu in 1:4)

  { for(female in 0:1)

  # which rows of D correspond to this ’class’ w <- which( (D$edu==edu) & (D$age==age) & (D$female==female) & (D$black==black) ) # the ’label’ for this class class <- 1000*age + 100*edu + 10*female + black C[k,] <- c(class, sum(D$N[w])) k <- k + 1

  } }

  } } Now for each poll we estimate β, and then calculate ˆ p . There are 12 predictors, that

  bush

  are the indicators of a particular level of the demographic variables. For example, x i,

  1

  is the indicator that subject i is in education group 1. There is no intercept, because it would not be identified.

  # the labels for the 8 polls surveys <- unique(A$survey) # loop through the surveys for(jj in 2:8) { poll <- surveys[jj]

  # select out only the data corresponding to the jj’th poll data <- A[which(A$survey==poll),] # Each covariate is the indicator of a particular level of # the predictors X <- matrix(0, nrow(data), 12) X[,1] <- (data$age==1); X[,2] <- (data$age==2); X[,3] <- (data$age==3); X[,4] <- (data$age==4); X[,5] <- (data$ed==1); X[,6] <- (data$ed==2); X[,7] <- (data$ed==3); X[,8] <- (data$ed==4); X[,9] <- (data$female==0); X[,10] <- (data$female==1);

  # take out NAs w <- which(is.na(Y)==1) Y <- Y[-w] X <- X[-w,] # Parameter estimates B <- posterior.mode(Y,X) # storage for each class’s predicted probability p <- rep(0, 64) for(j in 1:64) {

  # this class label class <- C[j,1] ### determine the demographic variables from the class # the first digit in class is age age <- floor(class/1000) # the second digit in class is ed ed <- floor( (class%%1000)/100 ) # the third digit in class is sex female <- floor( (class - 1000*age - 100*ed)/10 ) # the fourth digit in class black <- (class - 1000*age - 100*ed - 10*female) # x holds the predictor values for this class x <- rep(0, 12) x[age] <- 1 x[ed+ 4] <- 1 if( female == 1 ) x[10] <- 1 else x[9] <- 1 if( black == 1 ) x[12] <- 1 else x[11] <- 1 # predicted probability p[j] <- pnorm( sum(x*B) )

  } # the final estimate for this poll

  } [1] "Poll 9152 estimates 0.53621 will vote for George Bush" [1] "Poll 9153 estimates 0.54021 will vote for George Bush" [1] "Poll 9154 estimates 0.52811 will vote for George Bush" [1] "Poll 9155 estimates 0.53882 will vote for George Bush" [1] "Poll 9156a estimates 0.55756 will vote for George Bush" [1] "Poll 9156b estimates 0.57444 will vote for George Bush" [1] "Poll 9157 estimates 0.55449 will vote for George Bush" [1] "Poll 9158 estimates 0.54445 will vote for George Bush" The final results of the 1988 election had George Bush with 53.4% of the vote and Michael Dukakis with 45.6%, with the remaining 1% apparently voting for third party candidates. Most of the polls seem to all slightly overestimate the proportion who would vote for Bush.

Dokumen yang terkait

SPIRITUALITY ENHANCEMENT THROUGH COLLECTIVE PRAYING (A Study of Jamaah Zikir Kanzus Sholawat in

0 0 21

MENEGUHKAN NKRI DI MADURA (Studi Atas Peran Pesantren dalam Membendung Radikalisme di Madura) UPHOLDING NKRI IN MADURA (A Study on The Role of Pesantren in Preventing Radicalism in Madura)

0 0 27

KONTESTASI LANGGAR DAN PESANTREN (Studi Atas Pranata Keagamaan Lokal di Sumenep Madura) THE CONTESTATION BETWEEN LANGGAR AND PESANTREN (A Study on Local Relegious Institution in Sumenep Madura)

0 0 23

The existence of Islamic Libraries in the Classical Century in Eastern and Western States the Role of Libraries in the Era of Islamic Civilization: A Case Study of Baghdad (Daula Abbasiyah) and Spanish (Bani Umaiyyah II)

0 0 20

The Policy Analysis Positions Mutation and Its Implication to the Government Performance in Soppeng Regency

0 0 8

PENGARUH BUDAYA ORGANISASI, MUTASI, MOTIVASI TERHADAP KINERJA PEGAWAI NEGERI SIPIL PADA KANTOR DISTRIK NAVIGASI KELAS I MAKASSAR Effect Of Organization Culture, Mutation, Motivation To Performance Of Civil Servant Employees in Office District Navigation C

0 0 16

The Effect of Organizational Commitment, Resource and the Prin- ciple of Budgeting to the Performance Employees in the Trade, Industry, Cooperatives and UKM Office Soppeng Regency

0 0 20

Sekolah Lurah: R-Urban Development Approach in Indonesia

0 0 9

Virtual Prototyping Application using Computer Aided Engineering in Plastic Product Manufacturing

0 0 6

Kajian Adsorpsi Krom Dalam Limbah Cair Penyamakan Kulit Chrome Adsorption in Tannery Wastewater - A Review

0 0 7