P : pressure V : Volume T : Temperature n : number of moles R : universal gas constant
Mathematical Model
Mathematical Model
Ideal gas law : PV = NRT
Equation, formula
E = mc
y = x 2 3 cos 2 (logθ )
V = IR
2
Q1 : Is this relationship true?
Q2 : What is the value of the constant R?
Mathematical Model
Answer these questions by a set of measurements :
PV = NRT
P : pressure
V : Volume
n : number of moles
(Pi , Vi , Ti , N i )
T : Temperature
Statistical Model
δ
Errors due to unknown outside factors exists.
Analysis of Variance Model (ANOVA)
p = P +δ p
v = V + δv
t = T +δt
n = N +δn
One-way ANOVA
Unobserved measurement errors (random)
(vδ p )+(v p−δδPV
pv = nRt( +
) δ= p(NRT
δnv−−δRt
p−
Rn(t−−Rn
δ t δ) t + Rδ nδ t )
v v−
n )δ
Systematic component
Data
Unknown parameter in systematic component
e.g. universal gas constant R
)
Y21 , Y22 ,..., Y2 n2
N = ∑ ni
(
N µµaa,,σσa22
2. Equal Variances
3. Independence
)
Ya1 , Ya 2 ,..., Yana
Yij = µ + α i + ε ij
i =1
µ=
Overall population mean (grand mean)
α i = µi − µ
1
N
a
∑n µ
i =1
i
i
∑n α
i =1
a
∑ niα i = 0
i =1
Between group
µ
Yij = µ + α i + ε ij
i = 1,2,..., a
j = 1,2,..., ni
a
∑n α
i
i
=0
i
µ + α1
ANOVA model
ε ij ~ N (0, σ 2 )
iid
i = 1,2,..., a
j = 1,2,..., ni
a
ε ij = Yij − µ i = Yij − µ − α i
i =1
1. Normal
ANOVA model
a
Random errors
(
Assumptions
One-way ANOVA
One-way ANOVA
ith treatment effect
Y11 , Y12 ,..., Y1n1
…………..
Random errors
Total sample size
)
N µµ22,,σσ222
Statistical Model
Compare multiple populations
(
N µµ11,,σσ122
Ideal gas law :
Model parameter
PiVi
N i Ti
R : universal gas constant
Assumptions : ideal gas, static and close environment
Observed data
Ri =
i
=0
ε ij ~ N (0, σ 2 )
iid
Within group
µ + α 2 + ε 21 = Y21
µ +α2
µ + α 2 + ε 22 = Y22
………….
µ + α 2 + ε 2 n = Y2 n
µ +αa
………….
2
2
1
Test for Treatment Effects
Test for Treatment Effects
H 0is: no
αs1are
=α
=l
α avs=vs0H 1 vs
H 1 : some
αnoti ≠all0the
HH0 0: Population
: There
treatment
the2 same.
effect.
H
: Population
are
s are
treatment
effects.
same.
1 : There
1
Yi =
ni
ith sample mean
overall sample mean
Break down of sum of squares
∑∑ (Y
ni
a
∑Y
ij
j =1
Y =
1
N
a
ni
∑∑ Y
ij
=
a
ni
i =1 j =1
∑n Y
(
)
(
)
i i
i =1
Total sum of squares
SST = ∑∑ Yij − Y
Treatment
Between Group
sum ofVariation
squares
SS A = ∑∑ Yi − Y
ni
a
i =1 j =1
Within
Error sum
Group
of squares
Variation
2
a
2
(
= ∑ ni Yi − Y
i =1
)
ni
a
2
ni
2
i =1 j =1
(
)
MS A =
SS A
1 a
=
∑ ni Yi − Y
a − 1 a − 1 i =1
Error mean squares
MS E =
SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1
2
αµi not all the same
H 1 true
2
2
MS Avariation
tendsoftoYi be
large
large
around
Y
MSE is unaffected by the population means.
i =1 j =1
F Distribution
Test for Treatment Effects
(
Treatment mean squares
SS A
1 a
=
MS A =
∑ ni Yi − Y
a − 1 a − 1 i =1
Error mean squares
MS E =
)
2
f (x ) =
SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1
1
r + r2
r1
r +r
Γ 1
− 1 2
r1
2
r1 2 x 2 − 1 1 + r1 x 2
r
r1 r 2 r 2
2
Γ Γ
2 2
0.8
MS A
F=
MS E
,x > 0
F Densities
0.9
Test statistic
a
i =1
Treatment mean squares
SS E = ∑∑ (Yij − Yi )
a
2
ij
a
1
N
i =1 j =1
) ( ( ) )
−
==T∑Y
(YijE +− ∑∑
YijY− YSS
−
YAi −++YSS
Yi ) (Yij − Yi )
=inSS
i Y
ni
i =1 j =1
X ~ F (r1 , r2 )
r1 = 2, r2 = 4
r1 = 4, r2 = 6
r1 = 9, r2 = 9
r1 = 12, r2 = 12
0.7
0.6
0.5
E(X ) =
r2
r2 − 2
0.4
Reject
if F
is too large.
Reject
H0 ifHF
> F(a-1,
N-a, α).
0 obs
F (a − 1, N − a,α )
0.3
Var ( X ) =
0.2
2r22 (r1 + r2 − 2 )
r1 (r2 − 2 ) (r2 − 4 )
2
0.1
Obtained from F distribution table
0
0
1
2
3
4
5
F Distribution Table
F Distribution
0.8
F (r1 , r2 )
0.7
0.6
0.5
0.4
0.3
α
0.2
0.1
0
0
1
2
3
F (r1 , r2 ,α )
4
5
F (3,4,0.05) = ?6.59
F (4,6,,0.01) = 9?.15
2
ANOVA Table
Computational Formulae
H 0 : α 1 = α 2 = lα a = 0 vs
Test statistic F =
MS A
MS E
H 1 : some α i ≠ 0
ni
Reject H0 if Fobs > F(a-1, N-a, α).
(
a
SS A = ∑ ni Yi − Y
SS
d.f.
MS
F-ratio
Treatment
SSA
Error
SSE
a-1
SSA/ ( a - 1)
MSA / MSE
N-a
SSE/ ( N - a)
Total
SST
N-1
)
ni
Ti 2 T..2
−
N
i =1 ni
=∑
ni
One-way ANOVA
(
Kodak
N = 45Data
Ti
32, 34, 31, 30, 37, 28, 28, 27, 30, 32, 26, 29, 27, 30, 31
3
15
23, 24, 25, 21, 26, 25, 27, 26, 22, 25, 27, 30, 25, 25, 27
vs
ni
i =1 j =1
SS
d.f.
MS
F-ratio
a 2- 1
SS681.69
A/ ( a - 1)
MS46.03
A / MSE
Error
621.86
SSE
N42
-a
SSE14.81
/ ( N - a)
Total
1985.24
SST
N44
-1
F (2,42,0.05) ≈ F (2,40,0.05) = 3.23
From F distribution table
F − ratio = 46.03 > 3.23
SS E = 1985
621
SS T .86
−.24
SS−A1363.38
Reject H0 at α = 0.05 .
T..2 2
1408
N
45
The color brightness of the three brands of films are significantly different.
Estimation
Estimation
Treatment effect : αi
Example : Color brightness of films
Point Yi − Y
Interval
(Y − Y )± t
i
N − a ,α 2
1 1
MS E −
ni N
Y1 =
452
= 30.13
15
Y2 =
578
378
= 38.53 Y3 =
= 25.2
15
15
(
)
Y =
1408
= 31.29
45
1
1 1
.13
80
48
] ) ± (2.021) 1(14
95% C.I. For α1 : [(−Y−30
.16
.29
Y ±−,±1031
− .81) −
112.−
..t64
42 , 0.025 MS E
n1
N 15
45
1
1
1 1
95% C.I. For α2 - α3 : 13
.−
53Y±,3−16
021
αE2) >(14
α3.+81) +
)225
Y38
t.422,])0.±025(2.MS
±.84
[(10
49
.17
2..33
Difference in treatment effects : αi - αj
Point Yi − Y j
T..2
N
378
2
SS
SSTTT ==∑∑
1985
.Y24
46040
ij− −
i =1 j =1
ni
1363.38
SSA
452
α = 0.05
H 1 : not H 0
2
2
3
Ti 22 T578
452
378 2 1408 2
SS
1363
.−38
SSAA ==∑
+ ..
+
−
ni
N15
i =115
15
45
a
a
= ∑∑ Yij2 −
Source
2
H 0 : α1 = α 2 = α 3
2
Ti 2
ni
Treatment
46040
Yij = 578
44, 50, 47,T 32,
32, 36, 35,
32, 38,
38,∑∑
40,
36
T.. 34,
= 1408
452
+ 578
+ 378
T1 Fuji
= 452 43, T41,
3 = 378
2 = 578
i =1 j =1
Agfa
)
i =1
One-way ANOVA
Example : Color brightness of films
aBrand
= 3 n1 = n2 = n3 = 15
a
i =1 j =1
SS T = ∑∑ Yij − Y
i =1 j =1
ni
a
2
i =1 j =1
a
i =1
a
2
SS E = ∑∑ (Yij − Yi ) = ∑∑ Yij2 − ∑
a
T.. = ∑ Ti
overall total
j =1
i =1
Source
a
Ti = ∑ Yij
ith total
1
1
Interval (Yi − Y j ) ± t N −a ,α 2 MS E +
ni n j
95% C.I. For α1 - α2 : [− 11.24 , − 5.56]
n2 n3 15
α1 < α2
95% C.I. For α1 - α3 : [2.09 , 7.77]
α1 > α3
α 2 > α1 > α 3
15
Overall confidence < 95%
3
Two way ANOVA
Two way ANOVA
Example : Brightness of synthetic fabric
Example : Brightness of synthetic fabric
MTB > ANOVA 'Bright' = Time Temp Time*Temp.
Two-way
ANOVA
model:
MTB > printfactorial
'Bright' 'Time'
'Temp'
Temperature
Time (cycles)
350°F
375°F
400°F
40
38, 32, 30
37, 35, 40
36, 39, 43
50
40, 45, 36
39, 42, 46
39, 48, 47
Two-way factorial ANOVA model:
Yijk = µ + α i + β j + γ ij + ε ijk
∑α
i
k = 1,2,3 j = 1, 2,3 i = 1,2
= ∑ β j = ∑ γ ij = ∑ γ ij = 0
i
j
i
j
ε ijk ~ N (0,σ 2 )
iid
I nteraction
Yijk = µ + α i + β j + γ ij + ε ijk
Analysis
of Variance (Balanced Designs)
Data
Display
Row Bright Type
TimeLevels
Temp Values
Factor
Time
fixed α = 2 β 40
i 350
j =
1
38 fixed
40
Temp
3 j 350
i
i
∑
∑
∑
γ ij50
=
375
k = 1,2,3 j = 1, 2,3 i = 1,2
∑ γ400= 0
ij
j
2
32
40
350
3
30
40
350
Analysis
of Variance
for Bright
4
37
40
375
5
35
40
375
Source
DF40
MS
6
40
375 SS
7
36
400
Time
140
150.22
150.22
8
39
400
Temp
240
80.78
40.39
9
43
4003.44
Time*Temp
240
1.72
10
40
50
350
Error
12
186.00
15.50
11
45
50
350
Total
1750
420.44
12
36
350
……………………………
ε ijk ~ N (0,σ 2 )
iid
F
9.69
2.61
0.11
P
0.009
0.115
0.896
significant
Regression
Sir Francis Galton
(1822 – 1911)
Group mean
Time = 40
Time = 50
Height of Son
Non-additive
Additive
Time = 50
Time = 40
Height of Father
350
375
400
Temperature
Regression
Simple
Linear
Regression
Height of the sons of fathers regressed
towards the mean height of the population
Simple Linear Regression
Scatterplot
Linear
Model the relationship between dependent
variable and independent
variable(s)
one independent
variable
Regression line
Examples
Dependent variable ( Y )
I ndependent variable ( X )
Job performance
Extent of training
Return of a stock
Risk of the stock
Overall CGA
A-Level Score
Tree age (by C14)
Tree age (by tree rings)
A line well fit the data
4
Simple Linear Regression Model
Simple Linear Regression Model
{( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )}
Data :
Yi = α + βX i + ε i
Example : Y = Height of son (in cm)
X = Height of father (in cm)
ε i ~ N (0,σ 2 )
, i = 1,2,..., n
iid
assumptions
.9 0X.9+X15+ 15
Suppose
true relation
given by : YE (=Y )0=
More reasonable
relationship
Fathers with same heights
X
E( Y)
Y = 0.9X + 15
170170
169.3
168
175175
171.7 172.5
180180
174.6
185185
182.2 181.5
177
Sons with same heights
Y
ε ( Random Error)
Unrealistic! 1.3
169.3
Estimate
the regression171.7
line
-0.8
Fitfrom
a regression
line
to
the data
these
data
-2.4 observed 174.6
Unobserved
Observed
0.7
182.2
Unobserved
Observed
Estimation of Model Parameters
Fitting Regression Line
Sample statistics
Example : Study of how wheat yield depends on fertilizer.
1 n
X = ∑ Xi
n i =1
S xy = ∑ ( X i − X )(Yi − Y ) = ∑ X i Yi − nXY
1 n
Y = ∑ Yi
n i =1
S xx = ∑ (X i − X ) = ∑ X − nX
n
2
i =1
i =1
K
S xy
β =b=
S xx
n
i =1
i =1
S yy = ∑ (Yi − Y ) = ∑ Yi − nY
n
n
2
i
n
2
n
2
i =1
2
2
X = Fertilizer (in lb/acre)
Y = Yield (in bu/acre)
X
100
200
300
400
500
600
700
Y
40
50
50
70
65
65
80
i =1
X = 400
K
α = a = Y − bX
7
∑X
2
i
Fitting Regression Line
7
∑ X i2 = 1400000
7
= 184500
∑X Y
i =1
Y = 60
i i
n
280000
S xx = 1400000
X i2 − n−X(72)(400)
∑
i =1
S xy
bb == 16500 = 0.059
S280000
xx
i
2
= 26350
7
i i
= 184500
i =1
Fitting Regression Line
Y = 36.43 + 0.059 X
7
∑ Yi 2 = 26350
i =1
n
2
2
S yy
= 1150
16500
X i Yi −−(n7X)(Y400)(60 )
26350 − (7 )(60)S xyxy = 184500
∑
yy
i =1
i =1
∑Y
i =1
∑X Y
E (Y ) = α + βX
True regression line :
X = 400
= 1400000
i =1
D
Fitted regression line : Y = a + bX
≠
Y = 60
7
0.059 )(400 ) = 36.43
a = Y60−−b(X
Prediction
400
X 0 = 650
.03
Y0 = 36Y
.43
+ (60
0.059
74
78)(400 )
00 =
X0 = 0
Y0 = 36.43 ?
Fitted regression line : Y = 36.43 + 0.059 X
5
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
1400
2500
1200
2000
No. of Cases
No. of Cases
1000
800
600
400
1500
1000
500
200
0
10-Mar 15-Mar
0
28-Feb 10-Mar
20-Mar
25-Mar
30-Mar
4-Apr
9-Apr
14-Apr
19-Apr
20-Mar
30-Mar
9-Apr
19-Apr
29-Apr
9-May
19-May
14-Apr
19-Apr
-500
Date
Date
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
2500
1000
900
No. of patients in hospital
No. of Cases
2000
1500
1000
500
0
28-Feb 10-Mar
800
700
600
500
400
300
200
100
20-Mar
30-Mar
9-Apr
19-Apr
29-Apr
9-May
19-May
0
10-Mar
-500
Date
25-Mar
30-Mar
4-Apr
9-Apr
Danger of Extrapolation
SARS Trend
SARS Trend
2000
2000
1500
1000
500
10-Mar
20-Mar
30-Mar
9-Apr
-500
Date
19-Apr
29-Apr
9-May
19-May
No. of patients in hospital
No. of patients in hospital
20-Mar
Date
Danger of Extrapolation
0
28-Feb
15-Mar
1500
1000
500
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May
-500
19May
Date
6
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
No. of patients in hospital
No. of patients in hospital
2000
1500
1000
500
0
28-Feb
20-Mar
9-Apr
29-Apr
19-May
8-Jun
1200
1000
800
600
400
200
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May
19May
-500
Date
Date
Danger of Extrapolation
Nonlinear Relationships
SARS Trend
No. of patients in hospital
1200
1000
800
600
400
200
0
28-Feb
20-Mar
9-Apr
29-Apr
19-May
8-Jun
Date
Association ≠ Causation
Simpson’s Paradox
Example : Price and Demand for gas
Year
1960
1961
1962
1963
1964
1965
1966
1967
1968
Price
30
31
37
42
43
45
50
54
54
1969
57
Demand
134
112
136
109
105
87
56
43
77
35
Year
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
Price
58
58
60
73
88
89
92
97
100
102
Demand
65
56
58
55
49
39
36
46
40
42
1960-1965
Year
1974-1979
1966-1973
Price
Demand
Fitted regression line : Demand = 139.24 – 1.11 Price
? Low demand is due to high price. ?
7
Test For Regression Effect
Test For Regression Effect
Test H 0 : β = 0
Fitted values
Yi = a + bX i
Residuals
ri = Yi − Yi
H1 : β ≠ 0
vs
Random Error
≠
ε i = Yi − α − βX i
Decomposition of Variation
(
) (
)
Yi − Y = Yi − Y + Yi − Yi
Variation of Y
Test For Regression Effect
Unexplained
variation
Explained
variation
Test For Regression Effect
Decomposition
Break
down of sum
of Variation
of squares
∑ (Y
n
i
i =1
i =1
SST
Total sum of squares
=
Error sum of squares
H 0 : β = 0 vs
SS
MS R = R = SS R
1
Test statistic F =
MS R
MS E
)
SSR
i =1
(
)
+
2
SSE
SST = S yy
Regression sum of squares
Test For Regression Effect
(
n
n
2
(Yi Y−iY−)Y+ (Y2 i+−∑
Yi )Yi − Yi
−YiY−)Y ==∑
n n
S xy2 2 2
2
(Yaxx +(−=XbX
− bX
SS R = b∑2 S∑
bi X−i +X
Y) )i − Y )
S xx
i =1 i =1
2
SS EE = SS
S yyT −−bSS
SRxx = S yy −
S xy2
S xx
Test For Regression Effect
H1 : β ≠ 0
Example : Wheat yield example
Regression
line Y =S36
1150
43
+ 0.059 X S xy = 16500
S xx = 280000
yy .=
SS E
MS E =
n−2
Reject H0 if Fobs > F(1, n - 2, α).
ANOVA table
Source
SS
d.f.
MS
F-ratio
Regression
SSR
1
SSR
MSR / MSE
Error
SSE
n-2
SSE/ ( n - 2)
Total
SST
n-1
Source
SS
Regression
974.68
Total
1150
Regression line
d.f.
MS
Y = 36.43 + 0.059 X
1
2
SS = S yy =5 1150
b(02.S059
SS RError
= 974
.xx68) (280000
175.32 )T
974.68
F-ratio
27.805
SS T.32
−−SS
SS E = 175
1150
974
R .68
35.064
6
F (1,5,0.05) = 6.61 < 27.805
Reject H0 at α = 0.05 .
8
Coefficient of Determination
Strong relationship
Coefficient of Determination
High prediction power
Explained variation
R2 =
SS R
SS T
Total variation
0 ≤ R2 ≤ 1
Perfect linear relationship
No linear relationship
Example : R 2 = 974.68 = 84.8%
1150
C.I . For Regression Parameters
100(1 - α)% C.I. for β
100(1 - α)% C.I. for α
Large Sxx
b ± t n −2,α 2
Example : Wheat yield example
Regression line Y = 36.43 + 0.059 X
MS E
S xx
1 X
a ± t n −2 ,α 2 MS E +
n S xx
More accurate estimates
Demonstration
Prediction
Source
SS
d.f.
MS
F-ratio
Regression
974.68
1
974.68
27.805
Error
175.32
5
35.064
Total
1150
6
MS
S xx 280000
95% C.I. for β :
±0302
t 5,0±.025
[0b0..059
.57
) E]35.064
,(020.0288
0878
95% C.I. for α :
1 X 12 (400 )2
36
.172
[a32±..43
+ ) +
t 5,0±
MS
.892
36
43
±.025
.57
(,2237.236
) E(]35n.64
Sxx7 280000
Prediction
Predict the value of Y0 at a fixed value of X = X0
Point prediction :
C.I . For Regression Parameters
Y0 = a + bX 0
100(1 - α)% prediction interval (P.I.)
1 (X − X )
Y0 ± t n − 2 ,α 2 MS E 1 + + 0
n
S xx
2
Example : Wheat yield example
Regression line Y = 36.43 + 0.059 X
Source
SS
d.f.
MS
F-ratio
Regression
974.68
1
974.68
27.805
Error
175.32
5
35.064
Total
1150
6
At X0 = 450,
Y0 = 36
62.43
98 + (0.059)(450)
90% prediction interval
2
)2
1 (1X 0 (−450
1]+ )1++ + X −) 400
(.20512
(35
62
[62
50
143
,.02
75
.817
±
Y
t±5, 0±
MS
..98
.)837
E .064
0 .98
n 7 S xx280000
9
Prediction
Multiple Linear Regression
Example : Fuel consumption data
Data Display
Row
1
2
3
4
5
6
7
8
9
10
11
12
State
ME
NH
VT
FUEL
MA
RI
CN
NY
NJ
PA
OH
IN
IL
POP
TAX
1029
9.00
771
9.00
462
9.00
=5787
+
7.50
0
1TAX
968
8.00
3082
10.00
18366
8.00
7367
8.00
11926
8.00
10783
7.00
5291
8.00
11251
7.50
β
β
NLIC
INC
ROAD
540
3.571
441
4.092
268
3.865
+ 3060
DLIC
+
4.870
2
527
4.399
1760
5.342
8278
5.319
4074
5.126
6312
4.447
5948
4.512
2804
4.391
5903
5.126
1.976
1.250
1.586
INC
+
2.351
3
0.431
1.333
11.868
2.138
8.577
8.507
5.939
14.186
β
β
FUELC
β
DLIC
557
52.4781
404
57.1984
259
58.0087
ROAD
+
2396
52.8771
4
397
54.4422
1408
57.1058
6312
45.0724
3439
55.3007
5528
52.9264
5375
55.1609
3068
52.9957
5301
52.4664
ε
…………………………………………………..
Multiple Linear Regression
Example : Fuel consumption data
Regression
Analysis
Analysis of
Variance
The
regression
is
SOURCE
DFequationSS
MS
F
p
FUEL
= 37.7 - 4
3.483991.92
TAX + 1.34997.98
DLIC - 6.65
INC0.000
- 0.242 ROAD
Regression
22.68
Error
43
1892.05
44.00
Predictor
t-ratio
p
Total
47 Coef
5883.96 Stdev
Constant
37.68
18.57
2.03
0.049
TAX
-3.478
1.298
-2.68
0.010
DLIC
1.3366
0.1924
6.95
0.000
Unusual Observations
INC
-6.651
1.723
-3.86
Obs.
TAX
FUEL
Fit Stdev.Fit Residual 0.000
St.Resid
ROAD
-0.241764.7580.3391
37
5.0 63.963
3.723 -0.71
-0.795 0.480
-0.14 X
40
7.0
s = 6.633
96.812 73.371
R-sq = 67.8%
2.102
23.441
3.73R
R-sq(adj) = 64.9%
R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.
10
Mathematical Model
Ideal gas law : PV = NRT
Equation, formula
E = mc
y = x 2 3 cos 2 (logθ )
V = IR
2
Q1 : Is this relationship true?
Q2 : What is the value of the constant R?
Mathematical Model
Answer these questions by a set of measurements :
PV = NRT
P : pressure
V : Volume
n : number of moles
(Pi , Vi , Ti , N i )
T : Temperature
Statistical Model
δ
Errors due to unknown outside factors exists.
Analysis of Variance Model (ANOVA)
p = P +δ p
v = V + δv
t = T +δt
n = N +δn
One-way ANOVA
Unobserved measurement errors (random)
(vδ p )+(v p−δδPV
pv = nRt( +
) δ= p(NRT
δnv−−δRt
p−
Rn(t−−Rn
δ t δ) t + Rδ nδ t )
v v−
n )δ
Systematic component
Data
Unknown parameter in systematic component
e.g. universal gas constant R
)
Y21 , Y22 ,..., Y2 n2
N = ∑ ni
(
N µµaa,,σσa22
2. Equal Variances
3. Independence
)
Ya1 , Ya 2 ,..., Yana
Yij = µ + α i + ε ij
i =1
µ=
Overall population mean (grand mean)
α i = µi − µ
1
N
a
∑n µ
i =1
i
i
∑n α
i =1
a
∑ niα i = 0
i =1
Between group
µ
Yij = µ + α i + ε ij
i = 1,2,..., a
j = 1,2,..., ni
a
∑n α
i
i
=0
i
µ + α1
ANOVA model
ε ij ~ N (0, σ 2 )
iid
i = 1,2,..., a
j = 1,2,..., ni
a
ε ij = Yij − µ i = Yij − µ − α i
i =1
1. Normal
ANOVA model
a
Random errors
(
Assumptions
One-way ANOVA
One-way ANOVA
ith treatment effect
Y11 , Y12 ,..., Y1n1
…………..
Random errors
Total sample size
)
N µµ22,,σσ222
Statistical Model
Compare multiple populations
(
N µµ11,,σσ122
Ideal gas law :
Model parameter
PiVi
N i Ti
R : universal gas constant
Assumptions : ideal gas, static and close environment
Observed data
Ri =
i
=0
ε ij ~ N (0, σ 2 )
iid
Within group
µ + α 2 + ε 21 = Y21
µ +α2
µ + α 2 + ε 22 = Y22
………….
µ + α 2 + ε 2 n = Y2 n
µ +αa
………….
2
2
1
Test for Treatment Effects
Test for Treatment Effects
H 0is: no
αs1are
=α
=l
α avs=vs0H 1 vs
H 1 : some
αnoti ≠all0the
HH0 0: Population
: There
treatment
the2 same.
effect.
H
: Population
are
s are
treatment
effects.
same.
1 : There
1
Yi =
ni
ith sample mean
overall sample mean
Break down of sum of squares
∑∑ (Y
ni
a
∑Y
ij
j =1
Y =
1
N
a
ni
∑∑ Y
ij
=
a
ni
i =1 j =1
∑n Y
(
)
(
)
i i
i =1
Total sum of squares
SST = ∑∑ Yij − Y
Treatment
Between Group
sum ofVariation
squares
SS A = ∑∑ Yi − Y
ni
a
i =1 j =1
Within
Error sum
Group
of squares
Variation
2
a
2
(
= ∑ ni Yi − Y
i =1
)
ni
a
2
ni
2
i =1 j =1
(
)
MS A =
SS A
1 a
=
∑ ni Yi − Y
a − 1 a − 1 i =1
Error mean squares
MS E =
SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1
2
αµi not all the same
H 1 true
2
2
MS Avariation
tendsoftoYi be
large
large
around
Y
MSE is unaffected by the population means.
i =1 j =1
F Distribution
Test for Treatment Effects
(
Treatment mean squares
SS A
1 a
=
MS A =
∑ ni Yi − Y
a − 1 a − 1 i =1
Error mean squares
MS E =
)
2
f (x ) =
SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1
1
r + r2
r1
r +r
Γ 1
− 1 2
r1
2
r1 2 x 2 − 1 1 + r1 x 2
r
r1 r 2 r 2
2
Γ Γ
2 2
0.8
MS A
F=
MS E
,x > 0
F Densities
0.9
Test statistic
a
i =1
Treatment mean squares
SS E = ∑∑ (Yij − Yi )
a
2
ij
a
1
N
i =1 j =1
) ( ( ) )
−
==T∑Y
(YijE +− ∑∑
YijY− YSS
−
YAi −++YSS
Yi ) (Yij − Yi )
=inSS
i Y
ni
i =1 j =1
X ~ F (r1 , r2 )
r1 = 2, r2 = 4
r1 = 4, r2 = 6
r1 = 9, r2 = 9
r1 = 12, r2 = 12
0.7
0.6
0.5
E(X ) =
r2
r2 − 2
0.4
Reject
if F
is too large.
Reject
H0 ifHF
> F(a-1,
N-a, α).
0 obs
F (a − 1, N − a,α )
0.3
Var ( X ) =
0.2
2r22 (r1 + r2 − 2 )
r1 (r2 − 2 ) (r2 − 4 )
2
0.1
Obtained from F distribution table
0
0
1
2
3
4
5
F Distribution Table
F Distribution
0.8
F (r1 , r2 )
0.7
0.6
0.5
0.4
0.3
α
0.2
0.1
0
0
1
2
3
F (r1 , r2 ,α )
4
5
F (3,4,0.05) = ?6.59
F (4,6,,0.01) = 9?.15
2
ANOVA Table
Computational Formulae
H 0 : α 1 = α 2 = lα a = 0 vs
Test statistic F =
MS A
MS E
H 1 : some α i ≠ 0
ni
Reject H0 if Fobs > F(a-1, N-a, α).
(
a
SS A = ∑ ni Yi − Y
SS
d.f.
MS
F-ratio
Treatment
SSA
Error
SSE
a-1
SSA/ ( a - 1)
MSA / MSE
N-a
SSE/ ( N - a)
Total
SST
N-1
)
ni
Ti 2 T..2
−
N
i =1 ni
=∑
ni
One-way ANOVA
(
Kodak
N = 45Data
Ti
32, 34, 31, 30, 37, 28, 28, 27, 30, 32, 26, 29, 27, 30, 31
3
15
23, 24, 25, 21, 26, 25, 27, 26, 22, 25, 27, 30, 25, 25, 27
vs
ni
i =1 j =1
SS
d.f.
MS
F-ratio
a 2- 1
SS681.69
A/ ( a - 1)
MS46.03
A / MSE
Error
621.86
SSE
N42
-a
SSE14.81
/ ( N - a)
Total
1985.24
SST
N44
-1
F (2,42,0.05) ≈ F (2,40,0.05) = 3.23
From F distribution table
F − ratio = 46.03 > 3.23
SS E = 1985
621
SS T .86
−.24
SS−A1363.38
Reject H0 at α = 0.05 .
T..2 2
1408
N
45
The color brightness of the three brands of films are significantly different.
Estimation
Estimation
Treatment effect : αi
Example : Color brightness of films
Point Yi − Y
Interval
(Y − Y )± t
i
N − a ,α 2
1 1
MS E −
ni N
Y1 =
452
= 30.13
15
Y2 =
578
378
= 38.53 Y3 =
= 25.2
15
15
(
)
Y =
1408
= 31.29
45
1
1 1
.13
80
48
] ) ± (2.021) 1(14
95% C.I. For α1 : [(−Y−30
.16
.29
Y ±−,±1031
− .81) −
112.−
..t64
42 , 0.025 MS E
n1
N 15
45
1
1
1 1
95% C.I. For α2 - α3 : 13
.−
53Y±,3−16
021
αE2) >(14
α3.+81) +
)225
Y38
t.422,])0.±025(2.MS
±.84
[(10
49
.17
2..33
Difference in treatment effects : αi - αj
Point Yi − Y j
T..2
N
378
2
SS
SSTTT ==∑∑
1985
.Y24
46040
ij− −
i =1 j =1
ni
1363.38
SSA
452
α = 0.05
H 1 : not H 0
2
2
3
Ti 22 T578
452
378 2 1408 2
SS
1363
.−38
SSAA ==∑
+ ..
+
−
ni
N15
i =115
15
45
a
a
= ∑∑ Yij2 −
Source
2
H 0 : α1 = α 2 = α 3
2
Ti 2
ni
Treatment
46040
Yij = 578
44, 50, 47,T 32,
32, 36, 35,
32, 38,
38,∑∑
40,
36
T.. 34,
= 1408
452
+ 578
+ 378
T1 Fuji
= 452 43, T41,
3 = 378
2 = 578
i =1 j =1
Agfa
)
i =1
One-way ANOVA
Example : Color brightness of films
aBrand
= 3 n1 = n2 = n3 = 15
a
i =1 j =1
SS T = ∑∑ Yij − Y
i =1 j =1
ni
a
2
i =1 j =1
a
i =1
a
2
SS E = ∑∑ (Yij − Yi ) = ∑∑ Yij2 − ∑
a
T.. = ∑ Ti
overall total
j =1
i =1
Source
a
Ti = ∑ Yij
ith total
1
1
Interval (Yi − Y j ) ± t N −a ,α 2 MS E +
ni n j
95% C.I. For α1 - α2 : [− 11.24 , − 5.56]
n2 n3 15
α1 < α2
95% C.I. For α1 - α3 : [2.09 , 7.77]
α1 > α3
α 2 > α1 > α 3
15
Overall confidence < 95%
3
Two way ANOVA
Two way ANOVA
Example : Brightness of synthetic fabric
Example : Brightness of synthetic fabric
MTB > ANOVA 'Bright' = Time Temp Time*Temp.
Two-way
ANOVA
model:
MTB > printfactorial
'Bright' 'Time'
'Temp'
Temperature
Time (cycles)
350°F
375°F
400°F
40
38, 32, 30
37, 35, 40
36, 39, 43
50
40, 45, 36
39, 42, 46
39, 48, 47
Two-way factorial ANOVA model:
Yijk = µ + α i + β j + γ ij + ε ijk
∑α
i
k = 1,2,3 j = 1, 2,3 i = 1,2
= ∑ β j = ∑ γ ij = ∑ γ ij = 0
i
j
i
j
ε ijk ~ N (0,σ 2 )
iid
I nteraction
Yijk = µ + α i + β j + γ ij + ε ijk
Analysis
of Variance (Balanced Designs)
Data
Display
Row Bright Type
TimeLevels
Temp Values
Factor
Time
fixed α = 2 β 40
i 350
j =
1
38 fixed
40
Temp
3 j 350
i
i
∑
∑
∑
γ ij50
=
375
k = 1,2,3 j = 1, 2,3 i = 1,2
∑ γ400= 0
ij
j
2
32
40
350
3
30
40
350
Analysis
of Variance
for Bright
4
37
40
375
5
35
40
375
Source
DF40
MS
6
40
375 SS
7
36
400
Time
140
150.22
150.22
8
39
400
Temp
240
80.78
40.39
9
43
4003.44
Time*Temp
240
1.72
10
40
50
350
Error
12
186.00
15.50
11
45
50
350
Total
1750
420.44
12
36
350
……………………………
ε ijk ~ N (0,σ 2 )
iid
F
9.69
2.61
0.11
P
0.009
0.115
0.896
significant
Regression
Sir Francis Galton
(1822 – 1911)
Group mean
Time = 40
Time = 50
Height of Son
Non-additive
Additive
Time = 50
Time = 40
Height of Father
350
375
400
Temperature
Regression
Simple
Linear
Regression
Height of the sons of fathers regressed
towards the mean height of the population
Simple Linear Regression
Scatterplot
Linear
Model the relationship between dependent
variable and independent
variable(s)
one independent
variable
Regression line
Examples
Dependent variable ( Y )
I ndependent variable ( X )
Job performance
Extent of training
Return of a stock
Risk of the stock
Overall CGA
A-Level Score
Tree age (by C14)
Tree age (by tree rings)
A line well fit the data
4
Simple Linear Regression Model
Simple Linear Regression Model
{( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )}
Data :
Yi = α + βX i + ε i
Example : Y = Height of son (in cm)
X = Height of father (in cm)
ε i ~ N (0,σ 2 )
, i = 1,2,..., n
iid
assumptions
.9 0X.9+X15+ 15
Suppose
true relation
given by : YE (=Y )0=
More reasonable
relationship
Fathers with same heights
X
E( Y)
Y = 0.9X + 15
170170
169.3
168
175175
171.7 172.5
180180
174.6
185185
182.2 181.5
177
Sons with same heights
Y
ε ( Random Error)
Unrealistic! 1.3
169.3
Estimate
the regression171.7
line
-0.8
Fitfrom
a regression
line
to
the data
these
data
-2.4 observed 174.6
Unobserved
Observed
0.7
182.2
Unobserved
Observed
Estimation of Model Parameters
Fitting Regression Line
Sample statistics
Example : Study of how wheat yield depends on fertilizer.
1 n
X = ∑ Xi
n i =1
S xy = ∑ ( X i − X )(Yi − Y ) = ∑ X i Yi − nXY
1 n
Y = ∑ Yi
n i =1
S xx = ∑ (X i − X ) = ∑ X − nX
n
2
i =1
i =1
K
S xy
β =b=
S xx
n
i =1
i =1
S yy = ∑ (Yi − Y ) = ∑ Yi − nY
n
n
2
i
n
2
n
2
i =1
2
2
X = Fertilizer (in lb/acre)
Y = Yield (in bu/acre)
X
100
200
300
400
500
600
700
Y
40
50
50
70
65
65
80
i =1
X = 400
K
α = a = Y − bX
7
∑X
2
i
Fitting Regression Line
7
∑ X i2 = 1400000
7
= 184500
∑X Y
i =1
Y = 60
i i
n
280000
S xx = 1400000
X i2 − n−X(72)(400)
∑
i =1
S xy
bb == 16500 = 0.059
S280000
xx
i
2
= 26350
7
i i
= 184500
i =1
Fitting Regression Line
Y = 36.43 + 0.059 X
7
∑ Yi 2 = 26350
i =1
n
2
2
S yy
= 1150
16500
X i Yi −−(n7X)(Y400)(60 )
26350 − (7 )(60)S xyxy = 184500
∑
yy
i =1
i =1
∑Y
i =1
∑X Y
E (Y ) = α + βX
True regression line :
X = 400
= 1400000
i =1
D
Fitted regression line : Y = a + bX
≠
Y = 60
7
0.059 )(400 ) = 36.43
a = Y60−−b(X
Prediction
400
X 0 = 650
.03
Y0 = 36Y
.43
+ (60
0.059
74
78)(400 )
00 =
X0 = 0
Y0 = 36.43 ?
Fitted regression line : Y = 36.43 + 0.059 X
5
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
1400
2500
1200
2000
No. of Cases
No. of Cases
1000
800
600
400
1500
1000
500
200
0
10-Mar 15-Mar
0
28-Feb 10-Mar
20-Mar
25-Mar
30-Mar
4-Apr
9-Apr
14-Apr
19-Apr
20-Mar
30-Mar
9-Apr
19-Apr
29-Apr
9-May
19-May
14-Apr
19-Apr
-500
Date
Date
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
2500
1000
900
No. of patients in hospital
No. of Cases
2000
1500
1000
500
0
28-Feb 10-Mar
800
700
600
500
400
300
200
100
20-Mar
30-Mar
9-Apr
19-Apr
29-Apr
9-May
19-May
0
10-Mar
-500
Date
25-Mar
30-Mar
4-Apr
9-Apr
Danger of Extrapolation
SARS Trend
SARS Trend
2000
2000
1500
1000
500
10-Mar
20-Mar
30-Mar
9-Apr
-500
Date
19-Apr
29-Apr
9-May
19-May
No. of patients in hospital
No. of patients in hospital
20-Mar
Date
Danger of Extrapolation
0
28-Feb
15-Mar
1500
1000
500
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May
-500
19May
Date
6
Danger of Extrapolation
Danger of Extrapolation
SARS Trend
SARS Trend
No. of patients in hospital
No. of patients in hospital
2000
1500
1000
500
0
28-Feb
20-Mar
9-Apr
29-Apr
19-May
8-Jun
1200
1000
800
600
400
200
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May
19May
-500
Date
Date
Danger of Extrapolation
Nonlinear Relationships
SARS Trend
No. of patients in hospital
1200
1000
800
600
400
200
0
28-Feb
20-Mar
9-Apr
29-Apr
19-May
8-Jun
Date
Association ≠ Causation
Simpson’s Paradox
Example : Price and Demand for gas
Year
1960
1961
1962
1963
1964
1965
1966
1967
1968
Price
30
31
37
42
43
45
50
54
54
1969
57
Demand
134
112
136
109
105
87
56
43
77
35
Year
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
Price
58
58
60
73
88
89
92
97
100
102
Demand
65
56
58
55
49
39
36
46
40
42
1960-1965
Year
1974-1979
1966-1973
Price
Demand
Fitted regression line : Demand = 139.24 – 1.11 Price
? Low demand is due to high price. ?
7
Test For Regression Effect
Test For Regression Effect
Test H 0 : β = 0
Fitted values
Yi = a + bX i
Residuals
ri = Yi − Yi
H1 : β ≠ 0
vs
Random Error
≠
ε i = Yi − α − βX i
Decomposition of Variation
(
) (
)
Yi − Y = Yi − Y + Yi − Yi
Variation of Y
Test For Regression Effect
Unexplained
variation
Explained
variation
Test For Regression Effect
Decomposition
Break
down of sum
of Variation
of squares
∑ (Y
n
i
i =1
i =1
SST
Total sum of squares
=
Error sum of squares
H 0 : β = 0 vs
SS
MS R = R = SS R
1
Test statistic F =
MS R
MS E
)
SSR
i =1
(
)
+
2
SSE
SST = S yy
Regression sum of squares
Test For Regression Effect
(
n
n
2
(Yi Y−iY−)Y+ (Y2 i+−∑
Yi )Yi − Yi
−YiY−)Y ==∑
n n
S xy2 2 2
2
(Yaxx +(−=XbX
− bX
SS R = b∑2 S∑
bi X−i +X
Y) )i − Y )
S xx
i =1 i =1
2
SS EE = SS
S yyT −−bSS
SRxx = S yy −
S xy2
S xx
Test For Regression Effect
H1 : β ≠ 0
Example : Wheat yield example
Regression
line Y =S36
1150
43
+ 0.059 X S xy = 16500
S xx = 280000
yy .=
SS E
MS E =
n−2
Reject H0 if Fobs > F(1, n - 2, α).
ANOVA table
Source
SS
d.f.
MS
F-ratio
Regression
SSR
1
SSR
MSR / MSE
Error
SSE
n-2
SSE/ ( n - 2)
Total
SST
n-1
Source
SS
Regression
974.68
Total
1150
Regression line
d.f.
MS
Y = 36.43 + 0.059 X
1
2
SS = S yy =5 1150
b(02.S059
SS RError
= 974
.xx68) (280000
175.32 )T
974.68
F-ratio
27.805
SS T.32
−−SS
SS E = 175
1150
974
R .68
35.064
6
F (1,5,0.05) = 6.61 < 27.805
Reject H0 at α = 0.05 .
8
Coefficient of Determination
Strong relationship
Coefficient of Determination
High prediction power
Explained variation
R2 =
SS R
SS T
Total variation
0 ≤ R2 ≤ 1
Perfect linear relationship
No linear relationship
Example : R 2 = 974.68 = 84.8%
1150
C.I . For Regression Parameters
100(1 - α)% C.I. for β
100(1 - α)% C.I. for α
Large Sxx
b ± t n −2,α 2
Example : Wheat yield example
Regression line Y = 36.43 + 0.059 X
MS E
S xx
1 X
a ± t n −2 ,α 2 MS E +
n S xx
More accurate estimates
Demonstration
Prediction
Source
SS
d.f.
MS
F-ratio
Regression
974.68
1
974.68
27.805
Error
175.32
5
35.064
Total
1150
6
MS
S xx 280000
95% C.I. for β :
±0302
t 5,0±.025
[0b0..059
.57
) E]35.064
,(020.0288
0878
95% C.I. for α :
1 X 12 (400 )2
36
.172
[a32±..43
+ ) +
t 5,0±
MS
.892
36
43
±.025
.57
(,2237.236
) E(]35n.64
Sxx7 280000
Prediction
Predict the value of Y0 at a fixed value of X = X0
Point prediction :
C.I . For Regression Parameters
Y0 = a + bX 0
100(1 - α)% prediction interval (P.I.)
1 (X − X )
Y0 ± t n − 2 ,α 2 MS E 1 + + 0
n
S xx
2
Example : Wheat yield example
Regression line Y = 36.43 + 0.059 X
Source
SS
d.f.
MS
F-ratio
Regression
974.68
1
974.68
27.805
Error
175.32
5
35.064
Total
1150
6
At X0 = 450,
Y0 = 36
62.43
98 + (0.059)(450)
90% prediction interval
2
)2
1 (1X 0 (−450
1]+ )1++ + X −) 400
(.20512
(35
62
[62
50
143
,.02
75
.817
±
Y
t±5, 0±
MS
..98
.)837
E .064
0 .98
n 7 S xx280000
9
Prediction
Multiple Linear Regression
Example : Fuel consumption data
Data Display
Row
1
2
3
4
5
6
7
8
9
10
11
12
State
ME
NH
VT
FUEL
MA
RI
CN
NY
NJ
PA
OH
IN
IL
POP
TAX
1029
9.00
771
9.00
462
9.00
=5787
+
7.50
0
1TAX
968
8.00
3082
10.00
18366
8.00
7367
8.00
11926
8.00
10783
7.00
5291
8.00
11251
7.50
β
β
NLIC
INC
ROAD
540
3.571
441
4.092
268
3.865
+ 3060
DLIC
+
4.870
2
527
4.399
1760
5.342
8278
5.319
4074
5.126
6312
4.447
5948
4.512
2804
4.391
5903
5.126
1.976
1.250
1.586
INC
+
2.351
3
0.431
1.333
11.868
2.138
8.577
8.507
5.939
14.186
β
β
FUELC
β
DLIC
557
52.4781
404
57.1984
259
58.0087
ROAD
+
2396
52.8771
4
397
54.4422
1408
57.1058
6312
45.0724
3439
55.3007
5528
52.9264
5375
55.1609
3068
52.9957
5301
52.4664
ε
…………………………………………………..
Multiple Linear Regression
Example : Fuel consumption data
Regression
Analysis
Analysis of
Variance
The
regression
is
SOURCE
DFequationSS
MS
F
p
FUEL
= 37.7 - 4
3.483991.92
TAX + 1.34997.98
DLIC - 6.65
INC0.000
- 0.242 ROAD
Regression
22.68
Error
43
1892.05
44.00
Predictor
t-ratio
p
Total
47 Coef
5883.96 Stdev
Constant
37.68
18.57
2.03
0.049
TAX
-3.478
1.298
-2.68
0.010
DLIC
1.3366
0.1924
6.95
0.000
Unusual Observations
INC
-6.651
1.723
-3.86
Obs.
TAX
FUEL
Fit Stdev.Fit Residual 0.000
St.Resid
ROAD
-0.241764.7580.3391
37
5.0 63.963
3.723 -0.71
-0.795 0.480
-0.14 X
40
7.0
s = 6.633
96.812 73.371
R-sq = 67.8%
2.102
23.441
3.73R
R-sq(adj) = 64.9%
R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.
10