Sesi 11. Multiple Regression and Correlation Methods
Lecture 11
Regression and
Correlation methods
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
1
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Learni ng Obj ecti ves
1.
Describe the Linear Regression Model
2.
State the Regression Modeling Steps
3.
Explain Ordinary Least Squares
4.
Compute Regression Coefficients
5.
Understand and check model
assumptions
6.
Use of Computer Program
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
2
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Purpose of regressi on
Estimation
Estimate association between outcome
and exposure adjusted for other
covariates
Prediction
Use an estimated model to predict the
outcome given covariates in a new dataset
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
3
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Adj usti ng for confounders
True value
True value
•
Adjusted estimate
Not adjust
–
–
•
Unadjusted estimate
Cofactor is a collider
Cofactor is in causal path
May or may not adjust
–
–
Cofactor has missing
Cofactor has error
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
4
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Workfl ow
•
•
•
Scatterplots
Bivariate analysis
Regression
–
Model fitting
• Cofactors in/out
• Interactions
–
Test of assumptions
• Independent errors
• Linear effects
• Constant error variance
–
–
Influence (robustness)
Interactiom testing
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
5
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Correl ati on vs Regressi on
Deterministic vs. Statistical
Relationship
Correlation Coefficient
Simple Linear Regression
Biostatistics I: 2017-18
sawilopo@yahoo.com
10/10/2017
6
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Health
Determi ni sti c vs. Stati sti cal Rel ati onshi p
Body
Mass Index (BMI)
Income (millions $) vs bank’s assets
(billions $)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
7
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
BMI and Hei ght
BMI=
(body mass kg)/(height m)2
35
Fix body mass = 80 kg.
Height from 1.5 to 2.0 m.
Deterministic relationship
Mass, height BMI
BMI
30
25
20
1.5
sawilopo@yahoo.com
1.6
1.7
1.8
wzrost (m)
1.9
2
8 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Income vs. Assets
Income = a + b Assets
Assets 3.4 - 49 billion $
Income changes, even
for banks with the
same assets!
300
income (millions)
250
200
150
100
50
0
Statistical relationship
sawilopo@yahoo.com
0
20
40
60
assets (billions)
9 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Descri pti on of Rel ati onshi ps
A detertministic relationship is easy to
describe:
It allows for a perfect prediction:
body mass and height known exact BMI
Perfect prediction of quantities subject to a
statistical relationship is not possible:
by a formula
known assets varying income
But:
higher assets higher income (on average)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
10
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Stati sti cal Rel ati onshi ps: Exampl es
Heath Status Measure
linear
Heath Status Measure
60
50
40
30
20
10
70
60
50
40
30
20
10
0
0
$0
$20
$40
$60
0
$80
Income
Mental Health Score
16
14
12
10
8
6
4
100
Age
65
18
Education Level
50
60
55
50
45
40
35
30
2
0
20
40
60
80
0
0
50
Age
sawilopo@yahoo.com
Physical Health Score
100
quadratic
11
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Strength and Di recti on of a Li near
Associ ati on
How good a straight line fits the points on a
two-dimensional scatterplot?
Pearson’s correlation coefficient (often simply
called a correlation): r.
A measure of a linear association: the stronger the
association, the larger value of r.
Gives the “direction” of the relationship:
• positive r → positive association
large values of one variable → large values of the other
variable
• negative r → negative association
large values of one variable → small values of the other
variable
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
12
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Pearson’s Correl ati on Coeffi ci ent
n observations for a pair of
random variables (Y,X).
Assume
(x1,y1), …, (xn,yn)
Then
r
x x y y
x x y y
i
2
i
sawilopo@yahoo.com
i
2
i
13 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Bl ood Gl ucose and Vcf
23 patients with type I
diabetes.
Velocity of circumferential of
the left ventricle (Vcf) seems
to (linearly) increase with
blood glucose.
How to describe the
relation?
It is not deterministic.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
14
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: Correl ati on
Subject Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.7
mean glucose: 10.37; mean Vcf: 1.32
(15.3-10.37)2 +…+ (9.5-10.37)2 = 429.7
(1.76-1.32)2 +…+ (1.70-1.32)2 =1.19
(15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)
=9.43
9.43
r
0.417
429.7 1.19
Biostatistics I: 2017-18
15
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology
and Population Health
Correl ati on Coeffi ci ent: Speci al Val ues
Perfect positive association when r = +1.
Perfect negative association when r = -1.
No linear association (can be non-linear!),
or linear asociation with a horiziontal line
when r = 0.
NOTE: r has to be in [-1,+1].
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
16
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Correl ati on Coeffi ci ents
r = -0.5 n = 100
70
70
65
65
60
60
55
55
y
y
r = -0.9 n = 100
50
50
45
45
40
40
40
45
50
55
x
60
65
70
40
70
70
65
65
60
60
55
55
50
50
45
45
40
40
40
10/10/2017
sawilopo@yahoo.com
45
50
55
x
60
65
70
50
55
x
60
65
70
r = 0.9 n = 100
y
y
r = 0.0 n = 100
45
50
55
x
60
65
70
40
Biostatistics I: 2017-18
45
17
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si gni fi cance Test for Pearson’s Correl ati on
Coeffi ci ent
computed value of r will usually be
different from 0 due to sampling
variability.
The
One
may want to test the null hypothesis
n 2coefficient is 0.
that the true value
of
the
T r
2
1 r
If the two variables are normally distributed, under the null hypothesis, T should
have Student’s t distribution with n-2 degrees of freedom.
sawilopo@yahoo.com
18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19.0
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70
23 2
T 0.417
2.10
2
1 0.417
p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
We can reject the null hypothesis that the true value
of the correlation coefficient is 0.
Biostatistics I: 2017-18
19
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Further Remarks on Pearson’s Correl ati on
Coeffi ci ent
Reminder: the coefficient describes only a
linear association.
It is sensitive to outliers (i.e., the observations
which are away from the main bulk of data).
Often due to recording errors, but may be genuine
values.
A non-parametric version, Spearman’s rank
correlation coefficient, exists.
If non-zero, it does not imply a causal
relationship.
sawilopo@yahoo.com
20 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
A SIMPLE LINEAR REGRESSION
10/10/2017
sawilopo@yahoo.com
Biostatistics
21I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
22
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si mpl e Li near Regressi on: Bl ood
Gl ucose & Vcf (1)
Assume that Vcf is normally distributed
with N( ,2).
Assume a linear regression model:
the mean (average) value of Vcf changes
linearly with the level of blood glucose:
= α + β · (glucose level)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
23
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Termi nol ogy (1)
The dependent variable Y and the covariate
(independent, explanatory variable) X.
In our example, Vcf is Y, blood glucose level is
X.
We assume that Y is normally distributed
with N(Y,2).
We further postulate that, for X = x,
Y = Y(x) = α + β · x
α and β are the coefficients of the model.
α is called the intercept.
β is called the slope.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
24
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si mpl e Li near Regressi on
The straight line
describes the increase
in the mean of the
dependent variable as
a function of the
covariate level.
Individual observations
for the dependent
variable vary around
the regression line,
according to a normal
distribution with mean
0 and a constant
variance.
sawilopo@yahoo.com
25
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Termi nol ogy (2)
For an individual observation of Y we can write that
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).
Intepretation: an individual observation of Y can
randomly deviate from the mean, which is a linear
function of x.
ε is called the residual random error (measurement
error).
Note that 2 is assumed constant for all x.
Homoscedasticity assumption.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
26
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: The Intercept
Y(x) = α + β · x
For x=0, Y(0) = α + β· 0 = α
Use “centered” covariate: Y(x) = α + β· (x – x0)
α is the mean value of the dependent variable when x =
0.
But blood glucose level = 0 makes little sense...
Usually, one takes x0 = sample mean of observed x
values.
For x=x0 , Y(x0) = α + β· (x0-x0) = α + β· 0 = α
α is then the mean value when x = x0.
Easier to interpret.
Can help in estimating the model.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
27
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: The Sl ope
Y(x) = α + β · x
Consider two values of the covariate: x and x+1.
For x : Y(x) = α + β · x
For (x+1) : Y(x+1) = α + β · (x+1) = α + β · x + β = Y(x) + β
β is the change in the mean value of the dependent
variable corresponding to a unit change in the
covariate.
β > 0: positive relationship (x increases, the mean
increases).
β < 0: negative relationship (x increases, the mean
decreases).
β = 0: no change, i.e., no relationship.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
28
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Esti mati on
Y(x) = α + β · x
The equation describes a theoretical relationship.
In practice, we know neither α nor β .
We have to estimate them from the observed data.
This is often called fitting a model to data.
The estimated coefficients will be denoted by a and b.
How to estimate α and β ?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
29
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
30
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on for Vcf & Bl ood
Gl ucose
Vcf(x) = 1.10 + 0.022 · x
Estimated model:
Interpretation: if the blood glucose level
increases by 1 mmol/l, the mean value of Vcf
increases by 0.022 %/s.
Positive association.
Note that the estimate b of the slope is close to
0. Perhaps it differes from 0 only by chance…
We need a CI for β .
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
31
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Confi dence Interval for the Sl ope
CI for β : b ± tn-2,1-α/2 · SE(b)
(tn-2,1-α/2 is a percentile from Student’s tn-2 distribution).
In our case, n = 23 and SE(b) = 0.0105
95% CI for : [0.022 ± 2.08·0.0105] = [0.0002, 0.0438]
99% CI for : [0.022 ± 2.83·0.0105] = [-0.0077, 0.0517]
• For large n (≥100), the standard normal distribution can be used.
95% CI does not include 0 we can reject H0: = 0.
But 99% CI does.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
32
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Test of Si gni fi cance for the Sl ope
Alternatively, we could conduct a formal test.
H0: β = 0
Under the null hypothesis, T = b / SE(b) should have
Student’s t distribution with n-2 degrees of freedom.
HA: β ≠ 0
For Vcf data, T = 0.022/0.0105 = 2.09.
p = P (|t21| ≥ 2.09) = 0.049
p < 0.05 → we can reject H0 at the 5% significance level.
But not at the 1% level.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
33
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on of the Mean Val ue Based on a
Li near Regressi on Model
The prediction would be of interest, e.g., for a
group of subjects with a particular value of x.
Example:
Estimated model:
Take x = 10:
Vcf(x) = 1.10 + 0.022 · x
Vcf(x) = 1.10 + 0.022 · 10 = 1.32
This point prediction is subject to an error, due to
the estimation of the coefficients of the model.
One should compute a CI for the predicted value.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
34
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
35
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on of an Indi vi dual Observati on
One can also try to make a prediction for an
individual observation of the dependent variable.
The problem here is that the individual
observation will randomly deviate from the
mean.
The prediction would be of interest for, e.g., an
individual patient.
Point prediction makes thus no sense.
We can compute a CI for the observation.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
36
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for an Indi vi dual
Observati on
The prediction limits
are wider than those
for the mean value.
The prediction error
contains two
components now:
the error due to the
prediction of the
mean value;
the error due to the
variability (2)
around the mean
value.
10/10/2017
Biostatistics I: 2017-18
sawilopo@yahoo.com
37
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
STATA OUTPUT
10/10/2017
sawilopo@yahoo.com
Biostatistics
38I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
ASSUMPTION AND HOW TO
CHECK
10/10/2017
sawilopo@yahoo.com
Biostatistics
39I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on Model : Assumpti ons
The model is developed assuming that:
Y as independently collected
the mean value of the dependent variable Y is a linear
function of the covariate X;
for each value of α + β·X, the dependent variable is
normally distributed with constant variance 2.
These are assumptions: they need to be checked.
If not fulfilled, you may need to consider
using another form of the covariate;
using a transformation of the dependent variable; etc.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
40
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Checki ng the Assumpti ons
Recall, according to the model,
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).
We can estimate ε by
These estimates are called residuals
e = y – (a + b · x)
Σ e2/(n-1) will give an estimate of 2.
If the assumptions are correct, the residuals
should approximately have a normal
distribution with mean 0.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
41
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Anal ysi s of Resi dual s (1)
Plot the residuals against the observed
covariate values.
If the assumptions are met, the plot should be
evenly scattered for all covariate values.
10/10/201742
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Anal ysi s of Resi dual s (2)
The plot of the
residuals may reveal
non-constant
variance
(heteroscedasticity).
It can also point towards a non-
linear (w.r.t. the covariate values)
relationship.
10/10/2017
43
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose & Vcf: Resi dual s
The plot looks
reasonable.
10/10/201744
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf
23 patients with type
I diabetes.
Vcf seems to
(linearly) increase
with blood glucose.
How to describe the
relation?
It is not deterministic.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
45
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: The Test
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Glucose
15.3
10.8
8.1
19.5
7.2
5.3
9.3
11.1
7.5
12.2
6.7
5.2
19.0
15.1
6.7
4.2
10.3
12.5
16.1
13.3
4.9
8.8
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70
23 2
T 0.417
2.10
2
1 0.417
p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
We can reject the null hypothesis that the true value
of the correlation coefficient is 0.
Biostatistics I: 2017-18
46
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
47
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
48
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
49
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Standardized residuals
(residual/st.error) are
ordered and plotted
against the values
expected from the
standard normal
distribution.
The graph should look
approximately linear.
One might have doubts
in our example…
Normal F[(resid-m)/s]
0.25
0.50
0.75
To this aim, the normal
probability plot is used.
0.00
1.00
Checki ng Normal i ty of Resi dual s
0.00
10/10/2017
sawilopo@yahoo.com
0.25
Biostatistics I: 2017-18
0.50
Empirical P[i] = i/(N+1)
0.75
1.00
50 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Li near Regressi on for Log-Vcf
Let
us use ln(Vcf) as the dependent
variable.
The
model changes to
ln(Vcf) = α + β · (glucose level)
The
estimated model is
ln(Vcf) = 0.115 + 0.015 · (glucose level)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
51
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Model for Log-Vcf: Resi dual s (1)
No major problems in the residual plot.
-.4
-.2
Residuals
0
.2
.4
5
10/10/2017
sawilopo@yahoo.com
10
15
Blood glucose level
Biostatistics I: 2017-18
20
52 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Model for Log-Vcf: Resi dual s
One might argue that the normal probability plot for
the residuals looks better than for untransformed Vcf.
0.00
Normal F[(lresid-m)/s]
0.25
0.50
0.75
1.00
0.00
10/10/2017
sawilopo@yahoo.com
0.25
0.50
Empirical P[i] = i/(N+1)
Biostatistics I: 2017-18
0.75
1.00
53Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Interpretati on of the Model for Log-Vcf
The model implies that
It follows that, if blood glucose increases by 1 unit,
than the mean value of ln(Vcf) increases by 0.015.
ln(Vcf) = 0.115 + 0.015 · (glucose level)
Upon taking Vcf ≈ exp(ln(Vcf)),
Vcf = e0.115 · e0.015 · (glucose level) = e0.115 · (1.015)(glucose level)
We could conclude that the mean value of Vcf
increases exp(0.015) = 1.015 times per 1 unit of
blood glucose.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
54
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Choi ce of the Transformati on
Consider
power
transformations xs or ys
(s=...,-3,-2,-1,-½, 0(=ln),
½ ...)
The circle of powers.
Choose the quadrant,
which most closely
resembles the pattern
of the data.
Increase or decrease
the power of x or y
(relative to 1) according
to the indications.
• Example: for Quadrant II,
take s1 for y.
10/10/201755
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Choi ce of the Transformati on: Exampl e
Data resemble the
pattern of Quadrant
III.
We might want to use
s = 5)
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/ no)
Time-to-event
(e.g. time to fracture)
10/10/2017
sawilopo@yahoo.com
Linear regression
Logistic regression
Kaplan-Meier statistics
n/ a
Cox regression
Biostatistics I: 2017-18
Cox regression
assumes proportional
hazards between
groups
92
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Conti nuous outcome
Are the observations independent or correlated?
Outcome
Variable
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Continuous
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
(e.g. pain
scale,
cognitive
function)
between two independent
groups
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test :
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
over time in the means of two or
more groups (repeated
measurements)
parametric alternative to the ttest
non-parametric alternative to the
paired ttest
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Mixed models/ GEE
modeling: multivariate
Linear regression:
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
(= Mann-Whitney U test): non-
Kruskal-Wallis test: nonparametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
multivariate regression technique
coefficient
used
when
the
outcome
is
10/10/2017
Biostatistics I: 2017-18
93
continuous;
gives
slopes
sawilopo@yahoo.com
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bi nary or categori cal outcomes
(proporti ons)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/ no)
independent
correlated
Alternative to the chisquare test if sparse
cells:
Chi-square test:
McNemar’s chi-square test:
Fisher’s exact test: compares
compares proportions between
two or more groups
compares binary outcome between
correlated groups (e.g., before and
after)
proportions between independent
groups when there are sparse data
(some cells < 5).
Conditional logistic
regression: multivariate
McNemar’s exact test:
Relative risks: odds ratios
or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
compares proportions between
correlated groups when there are
sparse data (some cells < 5).
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
94
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Ti me-to-event outcome (survi val
data)
Are the observation groups independent or correlated?
Outcome
Variable
Time-toevent (e.g.,
time to
fracture)
independent
correlated
Kaplan-Meier statistics: estimates survival functions for
n/ a (already over
time)
each group (usually displayed graphically); compares survival
functions with log-rank test
Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
95
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Regression and
Correlation methods
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
1
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Learni ng Obj ecti ves
1.
Describe the Linear Regression Model
2.
State the Regression Modeling Steps
3.
Explain Ordinary Least Squares
4.
Compute Regression Coefficients
5.
Understand and check model
assumptions
6.
Use of Computer Program
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
2
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Purpose of regressi on
Estimation
Estimate association between outcome
and exposure adjusted for other
covariates
Prediction
Use an estimated model to predict the
outcome given covariates in a new dataset
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
3
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Adj usti ng for confounders
True value
True value
•
Adjusted estimate
Not adjust
–
–
•
Unadjusted estimate
Cofactor is a collider
Cofactor is in causal path
May or may not adjust
–
–
Cofactor has missing
Cofactor has error
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
4
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Workfl ow
•
•
•
Scatterplots
Bivariate analysis
Regression
–
Model fitting
• Cofactors in/out
• Interactions
–
Test of assumptions
• Independent errors
• Linear effects
• Constant error variance
–
–
Influence (robustness)
Interactiom testing
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
5
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Correl ati on vs Regressi on
Deterministic vs. Statistical
Relationship
Correlation Coefficient
Simple Linear Regression
Biostatistics I: 2017-18
sawilopo@yahoo.com
10/10/2017
6
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Health
Determi ni sti c vs. Stati sti cal Rel ati onshi p
Body
Mass Index (BMI)
Income (millions $) vs bank’s assets
(billions $)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
7
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
BMI and Hei ght
BMI=
(body mass kg)/(height m)2
35
Fix body mass = 80 kg.
Height from 1.5 to 2.0 m.
Deterministic relationship
Mass, height BMI
BMI
30
25
20
1.5
sawilopo@yahoo.com
1.6
1.7
1.8
wzrost (m)
1.9
2
8 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Income vs. Assets
Income = a + b Assets
Assets 3.4 - 49 billion $
Income changes, even
for banks with the
same assets!
300
income (millions)
250
200
150
100
50
0
Statistical relationship
sawilopo@yahoo.com
0
20
40
60
assets (billions)
9 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Descri pti on of Rel ati onshi ps
A detertministic relationship is easy to
describe:
It allows for a perfect prediction:
body mass and height known exact BMI
Perfect prediction of quantities subject to a
statistical relationship is not possible:
by a formula
known assets varying income
But:
higher assets higher income (on average)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
10
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Stati sti cal Rel ati onshi ps: Exampl es
Heath Status Measure
linear
Heath Status Measure
60
50
40
30
20
10
70
60
50
40
30
20
10
0
0
$0
$20
$40
$60
0
$80
Income
Mental Health Score
16
14
12
10
8
6
4
100
Age
65
18
Education Level
50
60
55
50
45
40
35
30
2
0
20
40
60
80
0
0
50
Age
sawilopo@yahoo.com
Physical Health Score
100
quadratic
11
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Strength and Di recti on of a Li near
Associ ati on
How good a straight line fits the points on a
two-dimensional scatterplot?
Pearson’s correlation coefficient (often simply
called a correlation): r.
A measure of a linear association: the stronger the
association, the larger value of r.
Gives the “direction” of the relationship:
• positive r → positive association
large values of one variable → large values of the other
variable
• negative r → negative association
large values of one variable → small values of the other
variable
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
12
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Pearson’s Correl ati on Coeffi ci ent
n observations for a pair of
random variables (Y,X).
Assume
(x1,y1), …, (xn,yn)
Then
r
x x y y
x x y y
i
2
i
sawilopo@yahoo.com
i
2
i
13 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Bl ood Gl ucose and Vcf
23 patients with type I
diabetes.
Velocity of circumferential of
the left ventricle (Vcf) seems
to (linearly) increase with
blood glucose.
How to describe the
relation?
It is not deterministic.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
14
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: Correl ati on
Subject Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.7
mean glucose: 10.37; mean Vcf: 1.32
(15.3-10.37)2 +…+ (9.5-10.37)2 = 429.7
(1.76-1.32)2 +…+ (1.70-1.32)2 =1.19
(15.3-10.37)(1.76-1.32) +…+ (9.5-10.37)(1.70-1.32)
=9.43
9.43
r
0.417
429.7 1.19
Biostatistics I: 2017-18
15
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology
and Population Health
Correl ati on Coeffi ci ent: Speci al Val ues
Perfect positive association when r = +1.
Perfect negative association when r = -1.
No linear association (can be non-linear!),
or linear asociation with a horiziontal line
when r = 0.
NOTE: r has to be in [-1,+1].
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
16
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Correl ati on Coeffi ci ents
r = -0.5 n = 100
70
70
65
65
60
60
55
55
y
y
r = -0.9 n = 100
50
50
45
45
40
40
40
45
50
55
x
60
65
70
40
70
70
65
65
60
60
55
55
50
50
45
45
40
40
40
10/10/2017
sawilopo@yahoo.com
45
50
55
x
60
65
70
50
55
x
60
65
70
r = 0.9 n = 100
y
y
r = 0.0 n = 100
45
50
55
x
60
65
70
40
Biostatistics I: 2017-18
45
17
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si gni fi cance Test for Pearson’s Correl ati on
Coeffi ci ent
computed value of r will usually be
different from 0 due to sampling
variability.
The
One
may want to test the null hypothesis
n 2coefficient is 0.
that the true value
of
the
T r
2
1 r
If the two variables are normally distributed, under the null hypothesis, T should
have Student’s t distribution with n-2 degrees of freedom.
sawilopo@yahoo.com
18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: The Test
Subject
Glucose
1
15.3
2
10.8
3
8.1
4
19.5
5
7.2
6
5.3
7
9.3
8
11.1
9
7.5
10
12.2
11
6.7
12
5.2
13
19.0
14
15.1
15
6.7
16
4.2
17
10.3
18
12.5
19
16.1
20
13.3
21
4.9
22
8.8
10/10/2017
23
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70
23 2
T 0.417
2.10
2
1 0.417
p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
We can reject the null hypothesis that the true value
of the correlation coefficient is 0.
Biostatistics I: 2017-18
19
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Further Remarks on Pearson’s Correl ati on
Coeffi ci ent
Reminder: the coefficient describes only a
linear association.
It is sensitive to outliers (i.e., the observations
which are away from the main bulk of data).
Often due to recording errors, but may be genuine
values.
A non-parametric version, Spearman’s rank
correlation coefficient, exists.
If non-zero, it does not imply a causal
relationship.
sawilopo@yahoo.com
20 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
A SIMPLE LINEAR REGRESSION
10/10/2017
sawilopo@yahoo.com
Biostatistics
21I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
22
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si mpl e Li near Regressi on: Bl ood
Gl ucose & Vcf (1)
Assume that Vcf is normally distributed
with N( ,2).
Assume a linear regression model:
the mean (average) value of Vcf changes
linearly with the level of blood glucose:
= α + β · (glucose level)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
23
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Termi nol ogy (1)
The dependent variable Y and the covariate
(independent, explanatory variable) X.
In our example, Vcf is Y, blood glucose level is
X.
We assume that Y is normally distributed
with N(Y,2).
We further postulate that, for X = x,
Y = Y(x) = α + β · x
α and β are the coefficients of the model.
α is called the intercept.
β is called the slope.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
24
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Si mpl e Li near Regressi on
The straight line
describes the increase
in the mean of the
dependent variable as
a function of the
covariate level.
Individual observations
for the dependent
variable vary around
the regression line,
according to a normal
distribution with mean
0 and a constant
variance.
sawilopo@yahoo.com
25
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Termi nol ogy (2)
For an individual observation of Y we can write that
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).
Intepretation: an individual observation of Y can
randomly deviate from the mean, which is a linear
function of x.
ε is called the residual random error (measurement
error).
Note that 2 is assumed constant for all x.
Homoscedasticity assumption.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
26
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: The Intercept
Y(x) = α + β · x
For x=0, Y(0) = α + β· 0 = α
Use “centered” covariate: Y(x) = α + β· (x – x0)
α is the mean value of the dependent variable when x =
0.
But blood glucose level = 0 makes little sense...
Usually, one takes x0 = sample mean of observed x
values.
For x=x0 , Y(x0) = α + β· (x0-x0) = α + β· 0 = α
α is then the mean value when x = x0.
Easier to interpret.
Can help in estimating the model.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
27
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: The Sl ope
Y(x) = α + β · x
Consider two values of the covariate: x and x+1.
For x : Y(x) = α + β · x
For (x+1) : Y(x+1) = α + β · (x+1) = α + β · x + β = Y(x) + β
β is the change in the mean value of the dependent
variable corresponding to a unit change in the
covariate.
β > 0: positive relationship (x increases, the mean
increases).
β < 0: negative relationship (x increases, the mean
decreases).
β = 0: no change, i.e., no relationship.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
28
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on: Esti mati on
Y(x) = α + β · x
The equation describes a theoretical relationship.
In practice, we know neither α nor β .
We have to estimate them from the observed data.
This is often called fitting a model to data.
The estimated coefficients will be denoted by a and b.
How to estimate α and β ?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
29
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
30
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on for Vcf & Bl ood
Gl ucose
Vcf(x) = 1.10 + 0.022 · x
Estimated model:
Interpretation: if the blood glucose level
increases by 1 mmol/l, the mean value of Vcf
increases by 0.022 %/s.
Positive association.
Note that the estimate b of the slope is close to
0. Perhaps it differes from 0 only by chance…
We need a CI for β .
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
31
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Confi dence Interval for the Sl ope
CI for β : b ± tn-2,1-α/2 · SE(b)
(tn-2,1-α/2 is a percentile from Student’s tn-2 distribution).
In our case, n = 23 and SE(b) = 0.0105
95% CI for : [0.022 ± 2.08·0.0105] = [0.0002, 0.0438]
99% CI for : [0.022 ± 2.83·0.0105] = [-0.0077, 0.0517]
• For large n (≥100), the standard normal distribution can be used.
95% CI does not include 0 we can reject H0: = 0.
But 99% CI does.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
32
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Test of Si gni fi cance for the Sl ope
Alternatively, we could conduct a formal test.
H0: β = 0
Under the null hypothesis, T = b / SE(b) should have
Student’s t distribution with n-2 degrees of freedom.
HA: β ≠ 0
For Vcf data, T = 0.022/0.0105 = 2.09.
p = P (|t21| ≥ 2.09) = 0.049
p < 0.05 → we can reject H0 at the 5% significance level.
But not at the 1% level.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
33
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on of the Mean Val ue Based on a
Li near Regressi on Model
The prediction would be of interest, e.g., for a
group of subjects with a particular value of x.
Example:
Estimated model:
Take x = 10:
Vcf(x) = 1.10 + 0.022 · x
Vcf(x) = 1.10 + 0.022 · 10 = 1.32
This point prediction is subject to an error, due to
the estimation of the coefficients of the model.
One should compute a CI for the predicted value.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
34
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
35
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on of an Indi vi dual Observati on
One can also try to make a prediction for an
individual observation of the dependent variable.
The problem here is that the individual
observation will randomly deviate from the
mean.
The prediction would be of interest for, e.g., an
individual patient.
Point prediction makes thus no sense.
We can compute a CI for the observation.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
36
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for an Indi vi dual
Observati on
The prediction limits
are wider than those
for the mean value.
The prediction error
contains two
components now:
the error due to the
prediction of the
mean value;
the error due to the
variability (2)
around the mean
value.
10/10/2017
Biostatistics I: 2017-18
sawilopo@yahoo.com
37
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
STATA OUTPUT
10/10/2017
sawilopo@yahoo.com
Biostatistics
38I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
ASSUMPTION AND HOW TO
CHECK
10/10/2017
sawilopo@yahoo.com
Biostatistics
39I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Li near Regressi on Model : Assumpti ons
The model is developed assuming that:
Y as independently collected
the mean value of the dependent variable Y is a linear
function of the covariate X;
for each value of α + β·X, the dependent variable is
normally distributed with constant variance 2.
These are assumptions: they need to be checked.
If not fulfilled, you may need to consider
using another form of the covariate;
using a transformation of the dependent variable; etc.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
40
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Checki ng the Assumpti ons
Recall, according to the model,
Y=α+β· x+ε,
where ε is normally distributed with N(0 ,2).
We can estimate ε by
These estimates are called residuals
e = y – (a + b · x)
Σ e2/(n-1) will give an estimate of 2.
If the assumptions are correct, the residuals
should approximately have a normal
distribution with mean 0.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
41
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Anal ysi s of Resi dual s (1)
Plot the residuals against the observed
covariate values.
If the assumptions are met, the plot should be
evenly scattered for all covariate values.
10/10/201742
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Anal ysi s of Resi dual s (2)
The plot of the
residuals may reveal
non-constant
variance
(heteroscedasticity).
It can also point towards a non-
linear (w.r.t. the covariate values)
relationship.
10/10/2017
43
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose & Vcf: Resi dual s
The plot looks
reasonable.
10/10/201744
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf
23 patients with type
I diabetes.
Vcf seems to
(linearly) increase
with blood glucose.
How to describe the
relation?
It is not deterministic.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
45
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bl ood Gl ucose and Vcf: The Test
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Glucose
15.3
10.8
8.1
19.5
7.2
5.3
9.3
11.1
7.5
12.2
6.7
5.2
19.0
15.1
6.7
4.2
10.3
12.5
16.1
13.3
4.9
8.8
9.5
sawilopo@yahoo.com
Vcf
1.76
1.34
1.27
1.47
1.27
1.49
1.31
1.09
1.18
1.22
1.25
1.19
1.95
1.28
1.52
1.12
1.37
1.19
1.05
1.32
1.03
1.12
1.70
23 2
T 0.417
2.10
2
1 0.417
p = P(|t21| ≥ 2.10) = 0.048 < 0.05.
We can reject the null hypothesis that the true value
of the correlation coefficient is 0.
Biostatistics I: 2017-18
46
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Rel ati onshi p Between Bl ood Gl ucose
and Vcf
Individual observations on
Vcf vary quite a bit even
for very similar levels of
blood glucose.
It seems, however, that
higher blood glucose level
leads to a higher average
Vcf.
How can we make this
description more formal?
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
47
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Esti mati on of the Coeffi ci ents of
a Li near Regressi on Model
Least squares method:
select the line which
minimizes
the sum of squares of the
differences
between the observed
values and
the values predicted by
the model (line).
Result:
Vcf (x) = 1.10 + 0.022 · x
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
48
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Predi cti on Li mi ts for the Mean Val ue
The prediction
limits get wider
the further we are
from the “center”
of the scatterplot.
I.e., precision of
the prediction
decreases if we
move further
away from the
mean of x.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
49
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Standardized residuals
(residual/st.error) are
ordered and plotted
against the values
expected from the
standard normal
distribution.
The graph should look
approximately linear.
One might have doubts
in our example…
Normal F[(resid-m)/s]
0.25
0.50
0.75
To this aim, the normal
probability plot is used.
0.00
1.00
Checki ng Normal i ty of Resi dual s
0.00
10/10/2017
sawilopo@yahoo.com
0.25
Biostatistics I: 2017-18
0.50
Empirical P[i] = i/(N+1)
0.75
1.00
50 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Li near Regressi on for Log-Vcf
Let
us use ln(Vcf) as the dependent
variable.
The
model changes to
ln(Vcf) = α + β · (glucose level)
The
estimated model is
ln(Vcf) = 0.115 + 0.015 · (glucose level)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
51
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Model for Log-Vcf: Resi dual s (1)
No major problems in the residual plot.
-.4
-.2
Residuals
0
.2
.4
5
10/10/2017
sawilopo@yahoo.com
10
15
Blood glucose level
Biostatistics I: 2017-18
20
52 Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Model for Log-Vcf: Resi dual s
One might argue that the normal probability plot for
the residuals looks better than for untransformed Vcf.
0.00
Normal F[(lresid-m)/s]
0.25
0.50
0.75
1.00
0.00
10/10/2017
sawilopo@yahoo.com
0.25
0.50
Empirical P[i] = i/(N+1)
Biostatistics I: 2017-18
0.75
1.00
53Health
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population
Interpretati on of the Model for Log-Vcf
The model implies that
It follows that, if blood glucose increases by 1 unit,
than the mean value of ln(Vcf) increases by 0.015.
ln(Vcf) = 0.115 + 0.015 · (glucose level)
Upon taking Vcf ≈ exp(ln(Vcf)),
Vcf = e0.115 · e0.015 · (glucose level) = e0.115 · (1.015)(glucose level)
We could conclude that the mean value of Vcf
increases exp(0.015) = 1.015 times per 1 unit of
blood glucose.
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
54
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Choi ce of the Transformati on
Consider
power
transformations xs or ys
(s=...,-3,-2,-1,-½, 0(=ln),
½ ...)
The circle of powers.
Choose the quadrant,
which most closely
resembles the pattern
of the data.
Increase or decrease
the power of x or y
(relative to 1) according
to the indications.
• Example: for Quadrant II,
take s1 for y.
10/10/201755
sawilopo@yahoo.com
Biostatistics I: 2017-18
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Choi ce of the Transformati on: Exampl e
Data resemble the
pattern of Quadrant
III.
We might want to use
s = 5)
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/ no)
Time-to-event
(e.g. time to fracture)
10/10/2017
sawilopo@yahoo.com
Linear regression
Logistic regression
Kaplan-Meier statistics
n/ a
Cox regression
Biostatistics I: 2017-18
Cox regression
assumes proportional
hazards between
groups
92
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Conti nuous outcome
Are the observations independent or correlated?
Outcome
Variable
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Continuous
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
(e.g. pain
scale,
cognitive
function)
between two independent
groups
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test :
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
over time in the means of two or
more groups (repeated
measurements)
parametric alternative to the ttest
non-parametric alternative to the
paired ttest
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Mixed models/ GEE
modeling: multivariate
Linear regression:
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
(= Mann-Whitney U test): non-
Kruskal-Wallis test: nonparametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
multivariate regression technique
coefficient
used
when
the
outcome
is
10/10/2017
Biostatistics I: 2017-18
93
continuous;
gives
slopes
sawilopo@yahoo.com
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Bi nary or categori cal outcomes
(proporti ons)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/ no)
independent
correlated
Alternative to the chisquare test if sparse
cells:
Chi-square test:
McNemar’s chi-square test:
Fisher’s exact test: compares
compares proportions between
two or more groups
compares binary outcome between
correlated groups (e.g., before and
after)
proportions between independent
groups when there are sparse data
(some cells < 5).
Conditional logistic
regression: multivariate
McNemar’s exact test:
Relative risks: odds ratios
or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
compares proportions between
correlated groups when there are
sparse data (some cells < 5).
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
94
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health
Ti me-to-event outcome (survi val
data)
Are the observation groups independent or correlated?
Outcome
Variable
Time-toevent (e.g.,
time to
fracture)
independent
correlated
Kaplan-Meier statistics: estimates survival functions for
n/ a (already over
time)
each group (usually displayed graphically); compares survival
functions with log-rank test
Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
10/10/2017
sawilopo@yahoo.com
Biostatistics I: 2017-18
95
Universitas Gadjah Mada, Faculty of Medicine, Department of Biostatistics, Epidemiology and Population Health