Sesi 11. Multiple Regression and Correlation Methods

Lecture 11
Regression and
Correlation methods

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

1

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Learni ng Obj ecti ves
1.

Describe the Linear Regression Model

2.

State the Regression Modeling Steps


3.

Explain Ordinary Least Squares

4.

Compute Regression Coefficients

5.

Understand and check model
assumptions

6.

Use of Computer Program

10/10/2017
sawilopo@yahoo.com


Biostatistics I: 2017-18

2

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Purpose of regressi on
 Estimation


Estimate association between outcome
and exposure adjusted for other
covariates

 Prediction


Use an estimated model to predict the
outcome given covariates in a new dataset


10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

3

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Adj usti ng for confounders
True value

True value



Adjusted estimate

Not adjust






Unadjusted estimate

Cofactor is a collider
Cofactor is in causal path

May or may not adjust



Cofactor has missing
Cofactor has error

10/10/2017
sawilopo@yahoo.com


Biostatistics I: 2017-18

4

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Workfl ow




Scatterplots
Bivariate analysis
Regression


Model fitting
• Cofactors in/out
• Interactions




Test of assumptions
• Independent errors
• Linear effects
• Constant error variance




Influence (robustness)
Interactiom testing

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

5


Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Correl ati on vs Regressi on
Deterministic vs. Statistical
Relationship
Correlation Coefficient
Simple Linear Regression
Biostatistics I: 2017-18
sawilopo@yahoo.com

10/10/2017

6
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Health

Determi ni sti c vs. Stati sti cal Rel ati onshi p
 Body

Mass Index (BMI)

 Income (millions $) vs bank’s assets
(billions $)

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

7

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

BMI and Hei ght




BMI=
(body mass kg)/(height m)2


35

Fix body mass = 80 kg.



Height from 1.5 to 2.0 m.



Deterministic relationship



Mass, height  BMI

BMI

30


25

20
1.5

sawilopo@yahoo.com

1.6

1.7
1.8
wzrost (m)

1.9

2

8 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population


Income vs. Assets




Income = a + b  Assets
Assets 3.4 - 49 billion $
Income changes, even
for banks with the
same assets!

300
income (millions)



250
200
150
100
50
0



Statistical relationship

sawilopo@yahoo.com

0

20

40

60

assets (billions)

9 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Descri pti on of Rel ati onshi ps


A detertministic relationship is easy to
describe:




It allows for a perfect prediction:




body mass and height known  exact BMI

Perfect prediction of quantities subject to a
statistical relationship is not possible:




by a formula

known assets  varying income

But:


higher assets  higher income (on average)

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

10

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Stati sti cal Rel ati onshi ps: Exampl es
Heath Status Measure

linear

Heath Status Measure

60
50
40
30
20
10

70
60
50
40
30
20
10
0

0
$0

$20

$40

$60

0

$80

Income
Mental Health Score

16
14
12
10
8
6
4

100

Age

65

18

Education Level

50

60
55
50
45
40
35
30

2

0

20

40

60

80

0
0

50

Age
sawilopo@yahoo.com

Physical Health Score

100

 quadratic

11
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Strength and Di recti on of a Li near
Associ ati on


How good a straight line fits the points on a
two-dimensional scatterplot?



Pearson’s correlation coefficient (often simply
called a correlation): r.




A measure of a linear association: the stronger the
association, the larger value of r.
Gives the “direction” of the relationship:
• positive r → positive association
large values of one variable → large values of the other
variable
• negative r → negative association
large values of one variable → small values of the other
variable

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

12

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Pearson’s Correl ati on Coeffi ci ent
n observations for a pair of
random variables (Y,X).

 Assume


(x1,y1), …, (xn,yn)

 Then

r

  x  x  y  y 
 x  x   y  y 
i

2

i

sawilopo@yahoo.com

i

2

i

13 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Bl ood Gl ucose and Vcf


23 patients with type I
diabetes.



Velocity of circumferential of
the left ventricle (Vcf) seems
to (linearly) increase with
blood glucose.



How to describe the
relation?


It is not deterministic.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

14

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: Correl ati on
Subject Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
sawilopo@yahoo.com

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.7

 mean glucose: 10.37; mean Vcf: 1.32
 (15.3-10.37)2 +…+ (9.5-10.37)2 = 429.7
 (1.76-1.32)2 +…+ (1.70-1.32)2 =1.19
 (15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)

=9.43

9.43
r
 0.417
429.7 1.19

Biostatistics I: 2017-18

15
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology
and Population Health

Correl ati on Coeffi ci ent: Speci al Val ues


Perfect positive association when r = +1.



Perfect negative association when r = -1.



No linear association (can be non-linear!),
or linear asociation with a horiziontal line
when r = 0.



NOTE: r has to be in [-1,+1].

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

16

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Correl ati on Coeffi ci ents
r = -0.5 n = 100

70

70

65

65

60

60

55

55

y

y

r = -0.9 n = 100

50

50

45

45

40

40
40

45

50

55
x

60

65

70

40

70

70

65

65

60

60

55

55

50

50

45

45

40

40
40

10/10/2017
sawilopo@yahoo.com

45

50

55
x

60

65

70

50

55
x

60

65

70

r = 0.9 n = 100

y

y

r = 0.0 n = 100

45

50

55
x

60

65

70

40

Biostatistics I: 2017-18

45

17

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si gni fi cance Test for Pearson’s Correl ati on
Coeffi ci ent

computed value of r will usually be
different from 0 due to sampling
variability.

 The

 One

may want to test the null hypothesis
n  2coefficient is 0.
that the true value
of
the
T  r
2
1 r

 If the two variables are normally distributed, under the null hypothesis, T should

have Student’s t distribution with n-2 degrees of freedom.

sawilopo@yahoo.com

18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19.0
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5

sawilopo@yahoo.com

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70

23  2
T  0.417
 2.10
2
1  0.417
 p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
 We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

Biostatistics I: 2017-18

19

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Further Remarks on Pearson’s Correl ati on
Coeffi ci ent


Reminder: the coefficient describes only a
linear association.



It is sensitive to outliers (i.e., the observations
which are away from the main bulk of data).






Often due to recording errors, but may be genuine
values.
A non-parametric version, Spearman’s rank
correlation coefficient, exists.

If non-zero, it does not imply a causal
relationship.

sawilopo@yahoo.com

20 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

A SIMPLE LINEAR REGRESSION

10/10/2017
sawilopo@yahoo.com

Biostatistics
21I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Rel ati onshi p Between Bl ood Gl ucose
and Vcf


Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.



It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.



How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

22

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si mpl e Li near Regressi on: Bl ood
Gl ucose & Vcf (1)


Assume that Vcf is normally distributed
with N( ,2).



Assume a linear regression model:
the mean (average) value of Vcf  changes
linearly with the level of blood glucose:

 = α + β · (glucose level)
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

23

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Termi nol ogy (1)


The dependent variable Y and the covariate
(independent, explanatory variable) X.


In our example, Vcf is Y, blood glucose level is
X.

We assume that Y is normally distributed
with N(Y,2).
 We further postulate that, for X = x,
Y = Y(x) = α + β · x




α and β are the coefficients of the model.



α is called the intercept.
β is called the slope.

10/10/2017

sawilopo@yahoo.com

Biostatistics I: 2017-18

24

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si mpl e Li near Regressi on


The straight line
describes the increase
in the mean of the
dependent variable as
a function of the
covariate level.



Individual observations
for the dependent
variable vary around
the regression line,
according to a normal
distribution with mean
0 and a constant
variance.

sawilopo@yahoo.com

25
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Termi nol ogy (2)


For an individual observation of Y we can write that
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).

Intepretation: an individual observation of Y can
randomly deviate from the mean, which is a linear
function of x.
 ε is called the residual random error (measurement
error).




Note that 2 is assumed constant for all x.


Homoscedasticity assumption.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

26

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: The Intercept
Y(x) = α + β · x


For x=0, Y(0) = α + β· 0 = α






Use “centered” covariate: Y(x) = α + β· (x – x0)




α is the mean value of the dependent variable when x =
0.
But blood glucose level = 0 makes little sense...
Usually, one takes x0 = sample mean of observed x
values.

For x=x0 , Y(x0) = α + β· (x0-x0) = α + β· 0 = α




α is then the mean value when x = x0.
Easier to interpret.
Can help in estimating the model.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

27

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: The Sl ope
Y(x) = α + β · x


Consider two values of the covariate: x and x+1.





For x : Y(x) = α + β · x
For (x+1) : Y(x+1) = α + β · (x+1) = α + β · x + β = Y(x) + β

β is the change in the mean value of the dependent
variable corresponding to a unit change in the
covariate.






β > 0: positive relationship (x increases, the mean
increases).
β < 0: negative relationship (x increases, the mean
decreases).
β = 0: no change, i.e., no relationship.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

28

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Esti mati on
Y(x) = α + β · x


The equation describes a theoretical relationship.



In practice, we know neither α nor β .



We have to estimate them from the observed data.





This is often called fitting a model to data.
The estimated coefficients will be denoted by a and b.

How to estimate α and β ?

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

29

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
 Result:
Vcf (x) = 1.10 + 0.022 · x


10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

30

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on for Vcf & Bl ood
Gl ucose
Vcf(x) = 1.10 + 0.022 · x



Estimated model:



Interpretation: if the blood glucose level
increases by 1 mmol/l, the mean value of Vcf
increases by 0.022 %/s.


Positive association.



Note that the estimate b of the slope is close to
0. Perhaps it differes from 0 only by chance…



We need a CI for β .

10/10/2017

sawilopo@yahoo.com

Biostatistics I: 2017-18

31

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Confi dence Interval for the Sl ope


CI for β : b ± tn-2,1-α/2 · SE(b)
(tn-2,1-α/2 is a percentile from Student’s tn-2 distribution).




In our case, n = 23 and SE(b) = 0.0105
95% CI for  : [0.022 ± 2.08·0.0105] = [0.0002, 0.0438]
99% CI for  : [0.022 ± 2.83·0.0105] = [-0.0077, 0.0517]
• For large n (≥100), the standard normal distribution can be used.



95% CI does not include 0  we can reject H0:  = 0.


But 99% CI does.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

32

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Test of Si gni fi cance for the Sl ope


Alternatively, we could conduct a formal test.



H0: β = 0



Under the null hypothesis, T = b / SE(b) should have
Student’s t distribution with n-2 degrees of freedom.





HA: β ≠ 0

For Vcf data, T = 0.022/0.0105 = 2.09.

p = P (|t21| ≥ 2.09) = 0.049
p < 0.05 → we can reject H0 at the 5% significance level.


But not at the 1% level.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

33

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on of the Mean Val ue Based on a
Li near Regressi on Model


The prediction would be of interest, e.g., for a
group of subjects with a particular value of x.



Example:



Estimated model:
Take x = 10:

Vcf(x) = 1.10 + 0.022 · x
Vcf(x) = 1.10 + 0.022 · 10 = 1.32



This point prediction is subject to an error, due to
the estimation of the coefficients of the model.



One should compute a CI for the predicted value.
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

34

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue


The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.



I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

35

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on of an Indi vi dual Observati on


One can also try to make a prediction for an
individual observation of the dependent variable.




The problem here is that the individual
observation will randomly deviate from the
mean.




The prediction would be of interest for, e.g., an
individual patient.

Point prediction makes thus no sense.

We can compute a CI for the observation.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

36

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for an Indi vi dual
Observati on


The prediction limits
are wider than those
for the mean value.



The prediction error
contains two
components now:

the error due to the
prediction of the
mean value;
 the error due to the
variability (2)
around the mean
value.
10/10/2017
Biostatistics I: 2017-18


sawilopo@yahoo.com

37

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

STATA OUTPUT

10/10/2017
sawilopo@yahoo.com

Biostatistics
38I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

ASSUMPTION AND HOW TO
CHECK
10/10/2017
sawilopo@yahoo.com

Biostatistics
39I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on Model : Assumpti ons


The model is developed assuming that:





Y as independently collected
the mean value of the dependent variable Y is a linear
function of the covariate X;
for each value of α + β·X, the dependent variable is
normally distributed with constant variance 2.



These are assumptions: they need to be checked.



If not fulfilled, you may need to consider



using another form of the covariate;
using a transformation of the dependent variable; etc.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

40

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Checki ng the Assumpti ons


Recall, according to the model,
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).



We can estimate ε by



These estimates are called residuals




e = y – (a + b · x)

Σ e2/(n-1) will give an estimate of 2.

If the assumptions are correct, the residuals
should approximately have a normal
distribution with mean 0.
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

41

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Anal ysi s of Resi dual s (1)



Plot the residuals against the observed
covariate values.
If the assumptions are met, the plot should be
evenly scattered for all covariate values.

10/10/201742
sawilopo@yahoo.com

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Anal ysi s of Resi dual s (2)


The plot of the
residuals may reveal
non-constant
variance
(heteroscedasticity).

 It can also point towards a non-

linear (w.r.t. the covariate values)
relationship.

10/10/2017
43
sawilopo@yahoo.com

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose & Vcf: Resi dual s


The plot looks
reasonable.

10/10/201744
sawilopo@yahoo.com

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf


23 patients with type
I diabetes.



Vcf seems to
(linearly) increase
with blood glucose.



How to describe the
relation?


It is not deterministic.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

45

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: The Test
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Glucose
15.3
10.8
8.1
19.5
7.2
5.3
9.3
11.1
7.5
12.2
6.7
5.2
19.0
15.1
6.7
4.2
10.3
12.5
16.1
13.3
4.9
8.8
9.5

sawilopo@yahoo.com

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70

23  2
T  0.417
 2.10
2
1  0.417
 p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
 We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

Biostatistics I: 2017-18

46

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Rel ati onshi p Between Bl ood Gl ucose
and Vcf


Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.



It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.



How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

47

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model


Least squares method:

select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
 Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

48

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue


The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.



I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

49

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health





Standardized residuals
(residual/st.error) are
ordered and plotted
against the values
expected from the
standard normal
distribution.

The graph should look
approximately linear.


One might have doubts
in our example…

Normal F[(resid-m)/s]
0.25
0.50
0.75

To this aim, the normal
probability plot is used.

0.00



1.00

Checki ng Normal i ty of Resi dual s

0.00

10/10/2017
sawilopo@yahoo.com

0.25

Biostatistics I: 2017-18

0.50
Empirical P[i] = i/(N+1)

0.75

1.00

50 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Li near Regressi on for Log-Vcf
 Let

us use ln(Vcf) as the dependent
variable.

 The

model changes to
ln(Vcf) = α + β · (glucose level)

 The

estimated model is
ln(Vcf) = 0.115 + 0.015 · (glucose level)

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

51

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Model for Log-Vcf: Resi dual s (1)
No major problems in the residual plot.

-.4

-.2

Residuals
0

.2

.4



5

10/10/2017
sawilopo@yahoo.com

10
15
Blood glucose level

Biostatistics I: 2017-18

20

52 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Model for Log-Vcf: Resi dual s
One might argue that the normal probability plot for
the residuals looks better than for untransformed Vcf.

0.00

Normal F[(lresid-m)/s]
0.25
0.50
0.75

1.00



0.00

10/10/2017
sawilopo@yahoo.com

0.25

0.50
Empirical P[i] = i/(N+1)

Biostatistics I: 2017-18

0.75

1.00

53Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Interpretati on of the Model for Log-Vcf


The model implies that



It follows that, if blood glucose increases by 1 unit,
than the mean value of ln(Vcf) increases by 0.015.

ln(Vcf) = 0.115 + 0.015 · (glucose level)

Upon taking Vcf ≈ exp(ln(Vcf)),
Vcf = e0.115 · e0.015 · (glucose level) = e0.115 · (1.015)(glucose level)





We could conclude that the mean value of Vcf
increases exp(0.015) = 1.015 times per 1 unit of
blood glucose.

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

54

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Choi ce of the Transformati on
 Consider

power
transformations xs or ys
(s=...,-3,-2,-1,-½, 0(=ln),
½ ...)
 The circle of powers.




Choose the quadrant,
which most closely
resembles the pattern
of the data.
Increase or decrease
the power of x or y
(relative to 1) according
to the indications.
• Example: for Quadrant II,
take s1 for y.

10/10/201755
sawilopo@yahoo.com

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Choi ce of the Transformati on: Exampl e


Data resemble the
pattern of Quadrant
III.



We might want to use
s = 5)

(e.g. pain scale,
cognitive function)

Binary or
categorical
(e.g. fracture yes/ no)

Time-to-event
(e.g. time to fracture)
10/10/2017
sawilopo@yahoo.com

Linear regression

Logistic regression
Kaplan-Meier statistics

n/ a

Cox regression

Biostatistics I: 2017-18

Cox regression
assumes proportional
hazards between
groups

92

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Conti nuous outcome
Are the observations independent or correlated?
Outcome
Variable

independent

correlated

Alternatives if the normality
assumption is violated (and
small sample size):

Continuous

Ttest: compares means

Paired ttest: compares means

Non-parametric statistics

(e.g. pain
scale,
cognitive
function)

between two independent
groups

between two related groups (e.g.,
the same subjects before and
after)

Wilcoxon sign-rank test :

Repeated-measures
ANOVA: compares changes

Wilcoxon sum-rank test

over time in the means of two or
more groups (repeated
measurements)

parametric alternative to the ttest

non-parametric alternative to the
paired ttest

ANOVA: compares means
between more than two
independent groups

Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Mixed models/ GEE
modeling: multivariate

Linear regression:

regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time

(= Mann-Whitney U test): non-

Kruskal-Wallis test: nonparametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric

alternative to Pearson’s correlation
multivariate regression technique
coefficient
used
when
the
outcome
is
10/10/2017
Biostatistics I: 2017-18
93
continuous;
gives
slopes
sawilopo@yahoo.com
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bi nary or categori cal outcomes
(proporti ons)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/ no)

independent

correlated

Alternative to the chisquare test if sparse
cells:

Chi-square test:

McNemar’s chi-square test:

Fisher’s exact test: compares

compares proportions between
two or more groups

compares binary outcome between
correlated groups (e.g., before and
after)

proportions between independent
groups when there are sparse data
(some cells < 5).

Conditional logistic
regression: multivariate

McNemar’s exact test:

Relative risks: odds ratios
or risk ratios

Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios

regression technique for a binary
outcome when groups are
correlated (e.g., matched data)

compares proportions between
correlated groups when there are
sparse data (some cells < 5).

GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

94

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Ti me-to-event outcome (survi val
data)
Are the observation groups independent or correlated?
Outcome
Variable

Time-toevent (e.g.,
time to
fracture)

independent

correlated

Kaplan-Meier statistics: estimates survival functions for

n/ a (already over
time)

each group (usually displayed graphically); compares survival
functions with log-rank test

Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)

Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios

10/10/2017
sawilopo@yahoo.com

Biostatistics I: 2017-18

95

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

19 819 7

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

1 29 9

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

7 202 3

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

0 63 87

The Correlation between students vocabulary master and reading comprehension

16 145 49

The correlation intelligence quatient (IQ) and studenst achievement in learning english : a correlational study on tenth grade of man 19 jakarta

0 57 61

An analysis of moral values through the rewards and punishments on the script of The chronicles of Narnia : The Lion, the witch, and the wardrobe

1 59 47

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

8 140 133

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

9 128 37

Transmission of Greek and Arabic Veteri

0 1 22