Sesi 11. Multiple Regression and Correlation Methods

Lecture 11
Regression and
Correlation methods

10/10/2017
[email protected]

Biostatistics I: 2017-18

1

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Learni ng Obj ecti ves
1.

Describe the Linear Regression Model

2.

State the Regression Modeling Steps


3.

Explain Ordinary Least Squares

4.

Compute Regression Coefficients

5.

Understand and check model
assumptions

6.

Use of Computer Program

10/10/2017
[email protected]


Biostatistics I: 2017-18

2

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Purpose of regressi on
 Estimation


Estimate association between outcome
and exposure adjusted for other
covariates

 Prediction


Use an estimated model to predict the
outcome given covariates in a new dataset


10/10/2017
[email protected]

Biostatistics I: 2017-18

3

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Adj usti ng for confounders
True value

True value



Adjusted estimate

Not adjust






Unadjusted estimate

Cofactor is a collider
Cofactor is in causal path

May or may not adjust



Cofactor has missing
Cofactor has error

10/10/2017
[email protected]


Biostatistics I: 2017-18

4

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Workfl ow




Scatterplots
Bivariate analysis
Regression


Model fitting
• Cofactors in/out
• Interactions




Test of assumptions
• Independent errors
• Linear effects
• Constant error variance




Influence (robustness)
Interactiom testing

10/10/2017
[email protected]

Biostatistics I: 2017-18

5


Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Correl ati on vs Regressi on
Deterministic vs. Statistical
Relationship
Correlation Coefficient
Simple Linear Regression
Biostatistics I: 2017-18
[email protected]

10/10/2017

6
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Health

Determi ni sti c vs. Stati sti cal Rel ati onshi p
 Body

Mass Index (BMI)

 Income (millions $) vs bank’s assets
(billions $)

10/10/2017
[email protected]

Biostatistics I: 2017-18

7

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

BMI and Hei ght




BMI=
(body mass kg)/(height m)2


35

Fix body mass = 80 kg.



Height from 1.5 to 2.0 m.



Deterministic relationship



Mass, height  BMI

BMI

30


25

20
1.5

[email protected]

1.6

1.7
1.8
wzrost (m)

1.9

2

8 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population


Income vs. Assets




Income = a + b  Assets
Assets 3.4 - 49 billion $
Income changes, even
for banks with the
same assets!

300
income (millions)



250
200
150
100
50
0



Statistical relationship

[email protected]

0

20

40

60

assets (billions)

9 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Descri pti on of Rel ati onshi ps


A detertministic relationship is easy to
describe:




It allows for a perfect prediction:




body mass and height known  exact BMI

Perfect prediction of quantities subject to a
statistical relationship is not possible:




by a formula

known assets  varying income

But:


higher assets  higher income (on average)

10/10/2017
[email protected]

Biostatistics I: 2017-18

10

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Stati sti cal Rel ati onshi ps: Exampl es
Heath Status Measure

linear

Heath Status Measure

60
50
40
30
20
10

70
60
50
40
30
20
10
0

0
$0

$20

$40

$60

0

$80

Income
Mental Health Score

16
14
12
10
8
6
4

100

Age

65

18

Education Level

50

60
55
50
45
40
35
30

2

0

20

40

60

80

0
0

50

Age
[email protected]

Physical Health Score

100

 quadratic

11
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Strength and Di recti on of a Li near
Associ ati on


How good a straight line fits the points on a
two-dimensional scatterplot?



Pearson’s correlation coefficient (often simply
called a correlation): r.




A measure of a linear association: the stronger the
association, the larger value of r.
Gives the “direction” of the relationship:
• positive r → positive association
large values of one variable → large values of the other
variable
• negative r → negative association
large values of one variable → small values of the other
variable

10/10/2017
[email protected]

Biostatistics I: 2017-18

12

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Pearson’s Correl ati on Coeffi ci ent
n observations for a pair of
random variables (Y,X).

 Assume


(x1,y1), …, (xn,yn)

 Then

r

  x  x  y  y 
 x  x   y  y 
i

2

i

[email protected]

i

2

i

13 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Bl ood Gl ucose and Vcf


23 patients with type I
diabetes.



Velocity of circumferential of
the left ventricle (Vcf) seems
to (linearly) increase with
blood glucose.



How to describe the
relation?


It is not deterministic.

10/10/2017
[email protected]

Biostatistics I: 2017-18

14

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: Correl ati on
Subject Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
[email protected]

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.7

 mean glucose: 10.37; mean Vcf: 1.32
 (15.3-10.37)2 +…+ (9.5-10.37)2 = 429.7
 (1.76-1.32)2 +…+ (1.70-1.32)2 =1.19
 (15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)

=9.43

9.43
r
 0.417
429.7 1.19

Biostatistics I: 2017-18

15
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology
and Population Health

Correl ati on Coeffi ci ent: Speci al Val ues


Perfect positive association when r = +1.



Perfect negative association when r = -1.



No linear association (can be non-linear!),
or linear asociation with a horiziontal line
when r = 0.



NOTE: r has to be in [-1,+1].

10/10/2017
[email protected]

Biostatistics I: 2017-18

16

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Correl ati on Coeffi ci ents
r = -0.5 n = 100

70

70

65

65

60

60

55

55

y

y

r = -0.9 n = 100

50

50

45

45

40

40
40

45

50

55
x

60

65

70

40

70

70

65

65

60

60

55

55

50

50

45

45

40

40
40

10/10/2017
[email protected]

45

50

55
x

60

65

70

50

55
x

60

65

70

r = 0.9 n = 100

y

y

r = 0.0 n = 100

45

50

55
x

60

65

70

40

Biostatistics I: 2017-18

45

17

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si gni fi cance Test for Pearson’s Correl ati on
Coeffi ci ent

computed value of r will usually be
different from 0 due to sampling
variability.

 The

 One

may want to test the null hypothesis
n  2coefficient is 0.
that the true value
of
the
T  r
2
1 r

 If the two variables are normally distributed, under the null hypothesis, T should

have Student’s t distribution with n-2 degrees of freedom.

[email protected]

18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19.0
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5

[email protected]

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70

23  2
T  0.417
 2.10
2
1  0.417
 p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
 We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

Biostatistics I: 2017-18

19

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Further Remarks on Pearson’s Correl ati on
Coeffi ci ent


Reminder: the coefficient describes only a
linear association.



It is sensitive to outliers (i.e., the observations
which are away from the main bulk of data).






Often due to recording errors, but may be genuine
values.
A non-parametric version, Spearman’s rank
correlation coefficient, exists.

If non-zero, it does not imply a causal
relationship.

[email protected]

20 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

A SIMPLE LINEAR REGRESSION

10/10/2017
[email protected]

Biostatistics
21I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Rel ati onshi p Between Bl ood Gl ucose
and Vcf


Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.



It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.



How can we make this
description more formal?
10/10/2017
[email protected]

Biostatistics I: 2017-18

22

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si mpl e Li near Regressi on: Bl ood
Gl ucose & Vcf (1)


Assume that Vcf is normally distributed
with N( ,2).



Assume a linear regression model:
the mean (average) value of Vcf  changes
linearly with the level of blood glucose:

 = α + β · (glucose level)
10/10/2017
[email protected]

Biostatistics I: 2017-18

23

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Termi nol ogy (1)


The dependent variable Y and the covariate
(independent, explanatory variable) X.


In our example, Vcf is Y, blood glucose level is
X.

We assume that Y is normally distributed
with N(Y,2).
 We further postulate that, for X = x,
Y = Y(x) = α + β · x




α and β are the coefficients of the model.



α is called the intercept.
β is called the slope.

10/10/2017

[email protected]

Biostatistics I: 2017-18

24

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Si mpl e Li near Regressi on


The straight line
describes the increase
in the mean of the
dependent variable as
a function of the
covariate level.



Individual observations
for the dependent
variable vary around
the regression line,
according to a normal
distribution with mean
0 and a constant
variance.

[email protected]

25
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Termi nol ogy (2)


For an individual observation of Y we can write that
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).

Intepretation: an individual observation of Y can
randomly deviate from the mean, which is a linear
function of x.
 ε is called the residual random error (measurement
error).




Note that 2 is assumed constant for all x.


Homoscedasticity assumption.

10/10/2017
[email protected]

Biostatistics I: 2017-18

26

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: The Intercept
Y(x) = α + β · x


For x=0, Y(0) = α + β· 0 = α






Use “centered” covariate: Y(x) = α + β· (x – x0)




α is the mean value of the dependent variable when x =
0.
But blood glucose level = 0 makes little sense...
Usually, one takes x0 = sample mean of observed x
values.

For x=x0 , Y(x0) = α + β· (x0-x0) = α + β· 0 = α




α is then the mean value when x = x0.
Easier to interpret.
Can help in estimating the model.

10/10/2017
[email protected]

Biostatistics I: 2017-18

27

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: The Sl ope
Y(x) = α + β · x


Consider two values of the covariate: x and x+1.





For x : Y(x) = α + β · x
For (x+1) : Y(x+1) = α + β · (x+1) = α + β · x + β = Y(x) + β

β is the change in the mean value of the dependent
variable corresponding to a unit change in the
covariate.






β > 0: positive relationship (x increases, the mean
increases).
β < 0: negative relationship (x increases, the mean
decreases).
β = 0: no change, i.e., no relationship.

10/10/2017
[email protected]

Biostatistics I: 2017-18

28

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on: Esti mati on
Y(x) = α + β · x


The equation describes a theoretical relationship.



In practice, we know neither α nor β .



We have to estimate them from the observed data.





This is often called fitting a model to data.
The estimated coefficients will be denoted by a and b.

How to estimate α and β ?

10/10/2017
[email protected]

Biostatistics I: 2017-18

29

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
 Result:
Vcf (x) = 1.10 + 0.022 · x


10/10/2017
[email protected]

Biostatistics I: 2017-18

30

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on for Vcf & Bl ood
Gl ucose
Vcf(x) = 1.10 + 0.022 · x



Estimated model:



Interpretation: if the blood glucose level
increases by 1 mmol/l, the mean value of Vcf
increases by 0.022 %/s.


Positive association.



Note that the estimate b of the slope is close to
0. Perhaps it differes from 0 only by chance…



We need a CI for β .

10/10/2017

[email protected]

Biostatistics I: 2017-18

31

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Confi dence Interval for the Sl ope


CI for β : b ± tn-2,1-α/2 · SE(b)
(tn-2,1-α/2 is a percentile from Student’s tn-2 distribution).




In our case, n = 23 and SE(b) = 0.0105
95% CI for  : [0.022 ± 2.08·0.0105] = [0.0002, 0.0438]
99% CI for  : [0.022 ± 2.83·0.0105] = [-0.0077, 0.0517]
• For large n (≥100), the standard normal distribution can be used.



95% CI does not include 0  we can reject H0:  = 0.


But 99% CI does.

10/10/2017
[email protected]

Biostatistics I: 2017-18

32

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Test of Si gni fi cance for the Sl ope


Alternatively, we could conduct a formal test.



H0: β = 0



Under the null hypothesis, T = b / SE(b) should have
Student’s t distribution with n-2 degrees of freedom.





HA: β ≠ 0

For Vcf data, T = 0.022/0.0105 = 2.09.

p = P (|t21| ≥ 2.09) = 0.049
p < 0.05 → we can reject H0 at the 5% significance level.


But not at the 1% level.

10/10/2017
[email protected]

Biostatistics I: 2017-18

33

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on of the Mean Val ue Based on a
Li near Regressi on Model


The prediction would be of interest, e.g., for a
group of subjects with a particular value of x.



Example:



Estimated model:
Take x = 10:

Vcf(x) = 1.10 + 0.022 · x
Vcf(x) = 1.10 + 0.022 · 10 = 1.32



This point prediction is subject to an error, due to
the estimation of the coefficients of the model.



One should compute a CI for the predicted value.
10/10/2017
[email protected]

Biostatistics I: 2017-18

34

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue


The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.



I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
[email protected]

Biostatistics I: 2017-18

35

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on of an Indi vi dual Observati on


One can also try to make a prediction for an
individual observation of the dependent variable.




The problem here is that the individual
observation will randomly deviate from the
mean.




The prediction would be of interest for, e.g., an
individual patient.

Point prediction makes thus no sense.

We can compute a CI for the observation.

10/10/2017
[email protected]

Biostatistics I: 2017-18

36

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for an Indi vi dual
Observati on


The prediction limits
are wider than those
for the mean value.



The prediction error
contains two
components now:

the error due to the
prediction of the
mean value;
 the error due to the
variability (2)
around the mean
value.
10/10/2017
Biostatistics I: 2017-18


[email protected]

37

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

STATA OUTPUT

10/10/2017
[email protected]

Biostatistics
38I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

ASSUMPTION AND HOW TO
CHECK
10/10/2017
[email protected]

Biostatistics
39I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Li near Regressi on Model : Assumpti ons


The model is developed assuming that:





Y as independently collected
the mean value of the dependent variable Y is a linear
function of the covariate X;
for each value of α + β·X, the dependent variable is
normally distributed with constant variance 2.



These are assumptions: they need to be checked.



If not fulfilled, you may need to consider



using another form of the covariate;
using a transformation of the dependent variable; etc.

10/10/2017
[email protected]

Biostatistics I: 2017-18

40

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Checki ng the Assumpti ons


Recall, according to the model,
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).



We can estimate ε by



These estimates are called residuals




e = y – (a + b · x)

Σ e2/(n-1) will give an estimate of 2.

If the assumptions are correct, the residuals
should approximately have a normal
distribution with mean 0.
10/10/2017
[email protected]

Biostatistics I: 2017-18

41

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Anal ysi s of Resi dual s (1)



Plot the residuals against the observed
covariate values.
If the assumptions are met, the plot should be
evenly scattered for all covariate values.

10/10/201742
[email protected]

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Anal ysi s of Resi dual s (2)


The plot of the
residuals may reveal
non-constant
variance
(heteroscedasticity).

 It can also point towards a non-

linear (w.r.t. the covariate values)
relationship.

10/10/2017
43
[email protected]

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose & Vcf: Resi dual s


The plot looks
reasonable.

10/10/201744
[email protected]

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf


23 patients with type
I diabetes.



Vcf seems to
(linearly) increase
with blood glucose.



How to describe the
relation?


It is not deterministic.

10/10/2017
[email protected]

Biostatistics I: 2017-18

45

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bl ood Gl ucose and Vcf: The Test
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Glucose
15.3
10.8
8.1
19.5
7.2
5.3
9.3
11.1
7.5
12.2
6.7
5.2
19.0
15.1
6.7
4.2
10.3
12.5
16.1
13.3
4.9
8.8
9.5

[email protected]

Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70

23  2
T  0.417
 2.10
2
1  0.417
 p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
 We can reject the null hypothesis that the true value

of the correlation coefficient is 0.

Biostatistics I: 2017-18

46

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Rel ati onshi p Between Bl ood Gl ucose
and Vcf


Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.



It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.



How can we make this
description more formal?
10/10/2017
[email protected]

Biostatistics I: 2017-18

47

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model


Least squares method:

select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
 Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
[email protected]

Biostatistics I: 2017-18

48

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Predi cti on Li mi ts for the Mean Val ue


The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.



I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
[email protected]

Biostatistics I: 2017-18

49

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health





Standardized residuals
(residual/st.error) are
ordered and plotted
against the values
expected from the
standard normal
distribution.

The graph should look
approximately linear.


One might have doubts
in our example…

Normal F[(resid-m)/s]
0.25
0.50
0.75

To this aim, the normal
probability plot is used.

0.00



1.00

Checki ng Normal i ty of Resi dual s

0.00

10/10/2017
[email protected]

0.25

Biostatistics I: 2017-18

0.50
Empirical P[i] = i/(N+1)

0.75

1.00

50 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Li near Regressi on for Log-Vcf
 Let

us use ln(Vcf) as the dependent
variable.

 The

model changes to
ln(Vcf) = α + β · (glucose level)

 The

estimated model is
ln(Vcf) = 0.115 + 0.015 · (glucose level)

10/10/2017
[email protected]

Biostatistics I: 2017-18

51

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Model for Log-Vcf: Resi dual s (1)
No major problems in the residual plot.

-.4

-.2

Residuals
0

.2

.4



5

10/10/2017
[email protected]

10
15
Blood glucose level

Biostatistics I: 2017-18

20

52 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Model for Log-Vcf: Resi dual s
One might argue that the normal probability plot for
the residuals looks better than for untransformed Vcf.

0.00

Normal F[(lresid-m)/s]
0.25
0.50
0.75

1.00



0.00

10/10/2017
[email protected]

0.25

0.50
Empirical P[i] = i/(N+1)

Biostatistics I: 2017-18

0.75

1.00

53Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population

Interpretati on of the Model for Log-Vcf


The model implies that



It follows that, if blood glucose increases by 1 unit,
than the mean value of ln(Vcf) increases by 0.015.

ln(Vcf) = 0.115 + 0.015 · (glucose level)

Upon taking Vcf ≈ exp(ln(Vcf)),
Vcf = e0.115 · e0.015 · (glucose level) = e0.115 · (1.015)(glucose level)





We could conclude that the mean value of Vcf
increases exp(0.015) = 1.015 times per 1 unit of
blood glucose.

10/10/2017
[email protected]

Biostatistics I: 2017-18

54

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Choi ce of the Transformati on
 Consider

power
transformations xs or ys
(s=...,-3,-2,-1,-½, 0(=ln),
½ ...)
 The circle of powers.




Choose the quadrant,
which most closely
resembles the pattern
of the data.
Increase or decrease
the power of x or y
(relative to 1) according
to the indications.
• Example: for Quadrant II,
take s1 for y.

10/10/201755
[email protected]

Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Choi ce of the Transformati on: Exampl e


Data resemble the
pattern of Quadrant
III.



We might want to use
s = 5)

(e.g. pain scale,
cognitive function)

Binary or
categorical
(e.g. fracture yes/ no)

Time-to-event
(e.g. time to fracture)
10/10/2017
[email protected]

Linear regression

Logistic regression
Kaplan-Meier statistics

n/ a

Cox regression

Biostatistics I: 2017-18

Cox regression
assumes proportional
hazards between
groups

92

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Conti nuous outcome
Are the observations independent or correlated?
Outcome
Variable

independent

correlated

Alternatives if the normality
assumption is violated (and
small sample size):

Continuous

Ttest: compares means

Paired ttest: compares means

Non-parametric statistics

(e.g. pain
scale,
cognitive
function)

between two independent
groups

between two related groups (e.g.,
the same subjects before and
after)

Wilcoxon sign-rank test :

Repeated-measures
ANOVA: compares changes

Wilcoxon sum-rank test

over time in the means of two or
more groups (repeated
measurements)

parametric alternative to the ttest

non-parametric alternative to the
paired ttest

ANOVA: compares means
between more than two
independent groups

Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables

Mixed models/ GEE
modeling: multivariate

Linear regression:

regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time

(= Mann-Whitney U test): non-

Kruskal-Wallis test: nonparametric alternative to ANOVA

Spearman rank correlation
coefficient: non-parametric

alternative to Pearson’s correlation
multivariate regression technique
coefficient
used
when
the
outcome
is
10/10/2017
Biostatistics I: 2017-18
93
continuous;
gives
slopes
[email protected]
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Bi nary or categori cal outcomes
(proporti ons)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/ no)

independent

correlated

Alternative to the chisquare test if sparse
cells:

Chi-square test:

McNemar’s chi-square test:

Fisher’s exact test: compares

compares proportions between
two or more groups

compares binary outcome between
correlated groups (e.g., before and
after)

proportions between independent
groups when there are sparse data
(some cells < 5).

Conditional logistic
regression: multivariate

McNemar’s exact test:

Relative risks: odds ratios
or risk ratios

Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios

regression technique for a binary
outcome when groups are
correlated (e.g., matched data)

compares proportions between
correlated groups when there are
sparse data (some cells < 5).

GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)

10/10/2017
[email protected]

Biostatistics I: 2017-18

94

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Ti me-to-event outcome (survi val
data)
Are the observation groups independent or correlated?
Outcome
Variable

Time-toevent (e.g.,
time to
fracture)

independent

correlated

Kaplan-Meier statistics: estimates survival functions for

n/ a (already over
time)

each group (usually displayed graphically); compares survival
functions with log-rank test

Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)

Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios

10/10/2017
[email protected]

Biostatistics I: 2017-18

95

Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health

Dokumen yang terkait

Analisis Komparasi Internet Financial Local Government Reporting Pada Website Resmi Kabupaten dan Kota di Jawa Timur The Comparison Analysis of Internet Financial Local Government Reporting on Official Website of Regency and City in East Java

19 819 7

ANTARA IDEALISME DAN KENYATAAN: KEBIJAKAN PENDIDIKAN TIONGHOA PERANAKAN DI SURABAYA PADA MASA PENDUDUKAN JEPANG TAHUN 1942-1945 Between Idealism and Reality: Education Policy of Chinese in Surabaya in the Japanese Era at 1942-1945)

1 29 9

Improving the Eighth Year Students' Tense Achievement and Active Participation by Giving Positive Reinforcement at SMPN 1 Silo in the 2013/2014 Academic Year

7 202 3

Improving the VIII-B Students' listening comprehension ability through note taking and partial dictation techniques at SMPN 3 Jember in the 2006/2007 Academic Year -

0 63 87

The Correlation between students vocabulary master and reading comprehension

16 145 49

The correlation intelligence quatient (IQ) and studenst achievement in learning english : a correlational study on tenth grade of man 19 jakarta

0 57 61

An analysis of moral values through the rewards and punishments on the script of The chronicles of Narnia : The Lion, the witch, and the wardrobe

1 59 47

Improping student's reading comprehension of descriptive text through textual teaching and learning (CTL)

8 140 133

The correlation between listening skill and pronunciation accuracy : a case study in the firt year of smk vocation higt school pupita bangsa ciputat school year 2005-2006

9 128 37

Transmission of Greek and Arabic Veteri

0 1 22