P : pressure V : Volume T : Temperature n : number of moles R : universal gas constant

Mathematical Model

Mathematical Model
Ideal gas law : PV = NRT

Equation, formula

E = mc

y = x 2 3 cos 2 (logθ )

V = IR

2

Q1 : Is this relationship true?
Q2 : What is the value of the constant R?

Mathematical Model

Answer these questions by a set of measurements :


PV = NRT
P : pressure

V : Volume

n : number of moles

(Pi , Vi , Ti , N i )

T : Temperature

Statistical Model
δ

Errors due to unknown outside factors exists.

Analysis of Variance Model (ANOVA)

p = P +δ p


v = V + δv

t = T +δt

n = N +δn

One-way ANOVA

Unobserved measurement errors (random)

(vδ p )+(v p−δδPV
pv = nRt( +
) δ= p(NRT
δnv−−δRt
p−
Rn(t−−Rn
δ t δ) t + Rδ nδ t )
v v−
n )δ

Systematic component
Data

Unknown parameter in systematic component
e.g. universal gas constant R

)

Y21 , Y22 ,..., Y2 n2

N = ∑ ni

(

N µµaa,,σσa22

2. Equal Variances
3. Independence

)


Ya1 , Ya 2 ,..., Yana

Yij = µ + α i + ε ij

i =1

µ=

Overall population mean (grand mean)
α i = µi − µ

1
N

a

∑n µ
i =1


i

i

∑n α
i =1

 a

 ∑ niα i = 0 
 i =1


Between group

µ

Yij = µ + α i + ε ij

i = 1,2,..., a


j = 1,2,..., ni
a

∑n α
i

i

=0

i

µ + α1

ANOVA model

ε ij ~ N (0, σ 2 )
iid


i = 1,2,..., a

j = 1,2,..., ni
a

ε ij = Yij − µ i = Yij − µ − α i

i =1

1. Normal

ANOVA model

a

Random errors

(

Assumptions


One-way ANOVA

One-way ANOVA

ith treatment effect

Y11 , Y12 ,..., Y1n1

…………..

Random errors

Total sample size

)

N µµ22,,σσ222

Statistical Model


Compare multiple populations

(

N µµ11,,σσ122

Ideal gas law :

Model parameter

PiVi
N i Ti

R : universal gas constant

Assumptions : ideal gas, static and close environment

Observed data


Ri =

i

=0

ε ij ~ N (0, σ 2 )
iid

Within group
µ + α 2 + ε 21 = Y21

µ +α2

µ + α 2 + ε 22 = Y22

………….

µ + α 2 + ε 2 n = Y2 n


µ +αa

………….
2

2

1

Test for Treatment Effects

Test for Treatment Effects

H 0is: no
αs1are

=l
α avs=vs0H 1 vs
H 1 : some
αnoti ≠all0the
HH0 0: Population
: There
treatment
the2 same.
effect.
H
: Population
are
s are
treatment
effects.
same.
1 : There
1
Yi =
ni

ith sample mean

overall sample mean

Break down of sum of squares

∑∑ (Y

ni

a

∑Y

ij

j =1

Y =

1
N

a

ni

∑∑ Y

ij

=

a

ni

i =1 j =1

∑n Y

(

)

(

)

i i

i =1

Total sum of squares

SST = ∑∑ Yij − Y

Treatment
Between Group
sum ofVariation
squares

SS A = ∑∑ Yi − Y

ni

a

i =1 j =1

Within
Error sum
Group
of squares
Variation

2

a

2

(

= ∑ ni Yi − Y
i =1

)

ni

a

2

ni

2

i =1 j =1

(

)

MS A =

SS A
1 a
=
∑ ni Yi − Y
a − 1 a − 1 i =1

Error mean squares

MS E =

SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1

2

αµi not all the same

H 1 true

2

2

MS Avariation
tendsoftoYi be
large
large
around
Y

MSE is unaffected by the population means.

i =1 j =1

F Distribution

Test for Treatment Effects

(

Treatment mean squares

SS A
1 a
=
MS A =
∑ ni Yi − Y
a − 1 a − 1 i =1

Error mean squares

MS E =

)

2

f (x ) =

SS E
1 a ni
2
=
∑∑ (Yij − Yi )
N − a N − a i =1 j =1

1

 r + r2 
r1
r +r
Γ 1
− 1 2

r1
2

  r1  2 x 2 − 1  1 + r1 x  2


r
 r1   r 2   r 2 
2


Γ  Γ 

 2   2 

0.8

MS A
F=
MS E

,x > 0

F Densities

0.9

Test statistic

a

i =1

Treatment mean squares

SS E = ∑∑ (Yij − Yi )
a

2

ij

a

1
N

i =1 j =1

) ( ( ) )


==T∑Y
(YijE +− ∑∑
YijY− YSS

YAi −++YSS
Yi ) (Yij − Yi )
=inSS
i Y

ni

i =1 j =1

X ~ F (r1 , r2 )

r1 = 2, r2 = 4
r1 = 4, r2 = 6
r1 = 9, r2 = 9
r1 = 12, r2 = 12

0.7
0.6
0.5

E(X ) =

r2
r2 − 2

0.4

Reject
if F
is too large.
Reject
H0 ifHF
> F(a-1,
N-a, α).
0 obs

F (a − 1, N − a,α )

0.3

Var ( X ) =

0.2

2r22 (r1 + r2 − 2 )

r1 (r2 − 2 ) (r2 − 4 )
2

0.1

Obtained from F distribution table

0
0

1

2

3

4

5

F Distribution Table

F Distribution
0.8

F (r1 , r2 )

0.7
0.6
0.5
0.4
0.3

α

0.2
0.1
0
0

1

2

3

F (r1 , r2 ,α )

4

5

F (3,4,0.05) = ?6.59

F (4,6,,0.01) = 9?.15

2

ANOVA Table

Computational Formulae

H 0 : α 1 = α 2 = lα a = 0 vs

Test statistic F =

MS A
MS E

H 1 : some α i ≠ 0

ni

Reject H0 if Fobs > F(a-1, N-a, α).

(

a

SS A = ∑ ni Yi − Y

SS

d.f.

MS

F-ratio

Treatment

SSA

Error

SSE

a-1

SSA/ ( a - 1)

MSA / MSE

N-a

SSE/ ( N - a)

Total

SST

N-1

)

ni

Ti 2 T..2

N
i =1 ni

=∑

ni

One-way ANOVA

(

Kodak

N = 45Data

Ti

32, 34, 31, 30, 37, 28, 28, 27, 30, 32, 26, 29, 27, 30, 31
3

15

23, 24, 25, 21, 26, 25, 27, 26, 22, 25, 27, 30, 25, 25, 27

vs

ni

i =1 j =1

SS

d.f.

MS

F-ratio

a 2- 1

SS681.69
A/ ( a - 1)

MS46.03
A / MSE

Error

621.86
SSE

N42
-a

SSE14.81
/ ( N - a)

Total

1985.24
SST

N44
-1

F (2,42,0.05) ≈ F (2,40,0.05) = 3.23

From F distribution table

F − ratio = 46.03 > 3.23
SS E = 1985
621
SS T .86
−.24
SS−A1363.38

Reject H0 at α = 0.05 .

T..2 2
1408
N
45

The color brightness of the three brands of films are significantly different.

Estimation

Estimation

Treatment effect : αi

Example : Color brightness of films

Point Yi − Y

Interval

(Y − Y )± t
i

N − a ,α 2

1 1
MS E  − 
 ni N 

Y1 =

452
= 30.13
15

Y2 =

578
378
= 38.53 Y3 =
= 25.2
15
15

(

)

Y =

1408
= 31.29
45

1 
1  1
.13
80
48
] ) ± (2.021) 1(14
95% C.I. For α1 : [(−Y−30
.16
.29
Y ±−,±1031
− .81) − 
112.−
..t64
42 , 0.025 MS E 
 n1

N  15

45 

1
 1
1 1
95% C.I. For α2 - α3 : 13
.−
53Y±,3−16
021
αE2) >(14
α3.+81)  + 
)225
Y38
t.422,])0.±025(2.MS
±.84
[(10
49
.17
2..33

Difference in treatment effects : αi - αj
Point Yi − Y j

T..2
N

378

2
SS
SSTTT ==∑∑
1985
.Y24
46040
ij− −
i =1 j =1

ni

1363.38
SSA

452

α = 0.05

H 1 : not H 0

2
2
3
Ti 22 T578
452
378 2 1408 2
SS
1363
.−38
SSAA ==∑
+ ..
+

ni
N15
i =115
15
45

a

a

= ∑∑ Yij2 −

Source

2

H 0 : α1 = α 2 = α 3

2

Ti 2
ni

Treatment

46040
Yij = 578
44, 50, 47,T 32,
32, 36, 35,
32, 38,
38,∑∑
40,
36
T.. 34,
= 1408
452
+ 578
+ 378
T1 Fuji
= 452 43, T41,
3 = 378
2 = 578
i =1 j =1
Agfa

)

i =1

One-way ANOVA

Example : Color brightness of films
aBrand
= 3 n1 = n2 = n3 = 15

a

i =1 j =1

SS T = ∑∑ Yij − Y
i =1 j =1

ni

a

2

i =1 j =1

a

i =1

a

2

SS E = ∑∑ (Yij − Yi ) = ∑∑ Yij2 − ∑
a

T.. = ∑ Ti

overall total

j =1

i =1

Source

a

Ti = ∑ Yij

ith total

1

1 

Interval (Yi − Y j ) ± t N −a ,α 2 MS E  + 
 ni n j 

95% C.I. For α1 - α2 : [− 11.24 , − 5.56]


 n2 n3 15
α1 < α2

95% C.I. For α1 - α3 : [2.09 , 7.77]

α1 > α3

α 2 > α1 > α 3

15 

Overall confidence < 95%

3

Two way ANOVA

Two way ANOVA

Example : Brightness of synthetic fabric

Example : Brightness of synthetic fabric
MTB > ANOVA 'Bright' = Time Temp Time*Temp.
Two-way
ANOVA
model:
MTB > printfactorial
'Bright' 'Time'
'Temp'

Temperature
Time (cycles)

350°F

375°F

400°F

40

38, 32, 30

37, 35, 40

36, 39, 43

50

40, 45, 36

39, 42, 46

39, 48, 47

Two-way factorial ANOVA model:

Yijk = µ + α i + β j + γ ij + ε ijk
∑α

i

k = 1,2,3 j = 1, 2,3 i = 1,2

= ∑ β j = ∑ γ ij = ∑ γ ij = 0

i

j

i

j

ε ijk ~ N (0,σ 2 )
iid

I nteraction

Yijk = µ + α i + β j + γ ij + ε ijk

Analysis
of Variance (Balanced Designs)
Data
Display
Row Bright Type
TimeLevels
Temp Values
Factor
Time
fixed α = 2 β 40
i 350
j =
1
38 fixed
40
Temp
3 j 350
i
i







γ ij50
=
375

k = 1,2,3 j = 1, 2,3 i = 1,2

∑ γ400= 0
ij

j

2
32
40
350
3
30
40
350
Analysis
of Variance
for Bright
4
37
40
375
5
35
40
375
Source
DF40
MS
6
40
375 SS
7
36
400
Time
140
150.22
150.22
8
39
400
Temp
240
80.78
40.39
9
43
4003.44
Time*Temp
240
1.72
10
40
50
350
Error
12
186.00
15.50
11
45
50
350
Total
1750
420.44
12
36
350
……………………………

ε ijk ~ N (0,σ 2 )
iid

F
9.69
2.61
0.11

P
0.009
0.115
0.896

significant

Regression
Sir Francis Galton
(1822 – 1911)

Group mean
Time = 40
Time = 50

Height of Son

Non-additive
Additive

Time = 50
Time = 40
Height of Father

350

375

400

Temperature

Regression
Simple
Linear
Regression

Height of the sons of fathers regressed
towards the mean height of the population

Simple Linear Regression
Scatterplot
Linear
Model the relationship between dependent
variable and independent
variable(s)
one independent
variable
Regression line

Examples
Dependent variable ( Y )

I ndependent variable ( X )

Job performance

Extent of training

Return of a stock

Risk of the stock

Overall CGA

A-Level Score

Tree age (by C14)

Tree age (by tree rings)

A line well fit the data

4

Simple Linear Regression Model

Simple Linear Regression Model

{( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )}

Data :

Yi = α + βX i + ε i

Example : Y = Height of son (in cm)
X = Height of father (in cm)

ε i ~ N (0,σ 2 )

, i = 1,2,..., n

iid

assumptions

.9 0X.9+X15+ 15
Suppose
true relation
given by : YE (=Y )0=
More reasonable
relationship
Fathers with same heights
X

E( Y)
Y = 0.9X + 15

170170

169.3

168

175175

171.7 172.5

180180

174.6

185185

182.2 181.5

177

Sons with same heights
Y

ε ( Random Error)

Unrealistic! 1.3
169.3
Estimate
the regression171.7
line
-0.8
Fitfrom
a regression
line
to
the data
these
data
-2.4 observed 174.6

Unobserved

Observed

0.7

182.2

Unobserved

Observed

Estimation of Model Parameters

Fitting Regression Line

Sample statistics

Example : Study of how wheat yield depends on fertilizer.

1 n
X = ∑ Xi
n i =1

S xy = ∑ ( X i − X )(Yi − Y ) = ∑ X i Yi − nXY

1 n
Y = ∑ Yi
n i =1

S xx = ∑ (X i − X ) = ∑ X − nX
n

2

i =1

i =1

K
S xy
β =b=
S xx

n

i =1

i =1

S yy = ∑ (Yi − Y ) = ∑ Yi − nY
n

n

2
i

n

2

n

2

i =1

2

2

X = Fertilizer (in lb/acre)

Y = Yield (in bu/acre)

X

100

200

300

400

500

600

700

Y

40

50

50

70

65

65

80

i =1

X = 400

K
α = a = Y − bX

7

∑X

2
i

Fitting Regression Line
7

∑ X i2 = 1400000

7

= 184500

∑X Y
i =1

Y = 60

i i

n

280000
S xx = 1400000
X i2 − n−X(72)(400)


i =1

S xy
bb == 16500 = 0.059
S280000
xx

i

2

= 26350

7
i i

= 184500

i =1

Fitting Regression Line

Y = 36.43 + 0.059 X

7

∑ Yi 2 = 26350
i =1

n

2

2
S yy
= 1150
16500
X i Yi −−(n7X)(Y400)(60 )
26350 − (7 )(60)S xyxy = 184500

yy

i =1

i =1

∑Y
i =1

∑X Y

E (Y ) = α + βX

True regression line :

X = 400

= 1400000

i =1

D
Fitted regression line : Y = a + bX



Y = 60
7

0.059 )(400 ) = 36.43
a = Y60−−b(X

Prediction
400
X 0 = 650


.03
Y0 = 36Y
.43
+ (60
0.059
74
78)(400 )
00 =

X0 = 0


Y0 = 36.43 ?

Fitted regression line : Y = 36.43 + 0.059 X



5

Danger of Extrapolation

Danger of Extrapolation

SARS Trend

SARS Trend

1400

2500

1200
2000

No. of Cases

No. of Cases

1000
800
600
400

1500

1000

500

200
0
10-Mar 15-Mar

0
28-Feb 10-Mar
20-Mar

25-Mar

30-Mar

4-Apr

9-Apr

14-Apr

19-Apr

20-Mar

30-Mar

9-Apr

19-Apr

29-Apr

9-May

19-May

14-Apr

19-Apr

-500

Date

Date

Danger of Extrapolation

Danger of Extrapolation

SARS Trend

SARS Trend

2500

1000
900
No. of patients in hospital

No. of Cases

2000
1500
1000
500
0
28-Feb 10-Mar

800
700
600
500
400
300
200
100

20-Mar

30-Mar

9-Apr

19-Apr

29-Apr

9-May

19-May

0
10-Mar

-500
Date

25-Mar

30-Mar

4-Apr

9-Apr

Danger of Extrapolation

SARS Trend

SARS Trend

2000

2000

1500

1000

500

10-Mar

20-Mar

30-Mar

9-Apr

-500
Date

19-Apr

29-Apr

9-May

19-May

No. of patients in hospital

No. of patients in hospital

20-Mar

Date

Danger of Extrapolation

0
28-Feb

15-Mar

1500
1000
500
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May
-500

19May

Date

6

Danger of Extrapolation

Danger of Extrapolation

SARS Trend

SARS Trend
No. of patients in hospital

No. of patients in hospital

2000
1500
1000
500
0
28-Feb

20-Mar

9-Apr

29-Apr

19-May

8-Jun

1200
1000
800
600
400
200
0
28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May

19May

-500
Date

Date

Danger of Extrapolation

Nonlinear Relationships

SARS Trend
No. of patients in hospital

1200
1000
800
600
400
200
0
28-Feb

20-Mar

9-Apr

29-Apr

19-May

8-Jun

Date

Association ≠ Causation

Simpson’s Paradox

Example : Price and Demand for gas
Year

1960

1961

1962

1963

1964

1965

1966

1967

1968

Price

30

31

37

42

43

45

50

54

54

1969

57

Demand

134

112

136

109

105

87

56

43

77

35

Year

1970

1971

1972

1973

1974

1975

1976

1977

1978

1979

Price

58

58

60

73

88

89

92

97

100

102

Demand

65

56

58

55

49

39

36

46

40

42

1960-1965

Year
1974-1979

1966-1973

Price

Demand

Fitted regression line : Demand = 139.24 – 1.11 Price
? Low demand is due to high price. ?

7

Test For Regression Effect

Test For Regression Effect
Test H 0 : β = 0

Fitted values


Yi = a + bX i

Residuals


ri = Yi − Yi

H1 : β ≠ 0

vs

Random Error



ε i = Yi − α − βX i

Decomposition of Variation

(

) (

)



Yi − Y = Yi − Y + Yi − Yi
Variation of Y

Test For Regression Effect

Unexplained
variation

Explained
variation

Test For Regression Effect
Decomposition
Break
down of sum
of Variation
of squares

∑ (Y
n

i

i =1

i =1

SST

Total sum of squares

=

Error sum of squares

H 0 : β = 0 vs
SS
MS R = R = SS R
1

Test statistic F =

MS R
MS E

)

SSR

i =1

(

)

+

2

SSE

SST = S yy

Regression sum of squares

Test For Regression Effect

(

n
n

2
(Yi Y−iY−)Y+ (Y2 i+−∑
Yi )Yi − Yi
−YiY−)Y ==∑

n n
S xy2 2 2
2
(Yaxx +(−=XbX
− bX
SS R = b∑2 S∑
bi X−i +X
Y) )i − Y )
S xx
i =1 i =1

2
SS EE = SS
S yyT −−bSS
SRxx = S yy −

S xy2
S xx

Test For Regression Effect

H1 : β ≠ 0

Example : Wheat yield example

Regression
line Y =S36
1150
43
+ 0.059 X S xy = 16500
S xx = 280000
yy .=

SS E
MS E =
n−2

Reject H0 if Fobs > F(1, n - 2, α).

ANOVA table
Source

SS

d.f.

MS

F-ratio

Regression

SSR

1

SSR

MSR / MSE

Error

SSE

n-2

SSE/ ( n - 2)

Total

SST

n-1

Source

SS

Regression

974.68

Total

1150

Regression line

 d.f.
MS
Y = 36.43 + 0.059 X
1

2
SS = S yy =5 1150
b(02.S059
SS RError
= 974
.xx68) (280000
175.32 )T

974.68

F-ratio
27.805

SS T.32
−−SS
SS E = 175
1150
974
R .68
35.064

6

F (1,5,0.05) = 6.61 < 27.805
Reject H0 at α = 0.05 .

8

Coefficient of Determination
Strong relationship

Coefficient of Determination

High prediction power
Explained variation

R2 =

SS R
SS T

Total variation

0 ≤ R2 ≤ 1
Perfect linear relationship

No linear relationship
Example : R 2 = 974.68 = 84.8%
1150

C.I . For Regression Parameters
100(1 - α)% C.I. for β

100(1 - α)% C.I. for α

Large Sxx

b ± t n −2,α 2

Example : Wheat yield example

Regression line Y = 36.43 + 0.059 X

MS E
S xx

1 X 

a ± t n −2 ,α 2 MS E  +
 n S xx 

More accurate estimates
Demonstration

Prediction

Source

SS

d.f.

MS

F-ratio

Regression

974.68

1

974.68

27.805

Error

175.32

5

35.064

Total

1150

6

MS
S xx 280000

95% C.I. for β :

±0302
t 5,0±.025
[0b0..059
.57
) E]35.064
,(020.0288
0878

95% C.I. for α :

 1 X 12  (400 )2 
36
.172
[a32±..43
+ ) +
t 5,0±
MS
.892

36
43
±.025
.57
(,2237.236
) E(]35n.64
Sxx7  280000 




Prediction

Predict the value of Y0 at a fixed value of X = X0
Point prediction :

C.I . For Regression Parameters


Y0 = a + bX 0

100(1 - α)% prediction interval (P.I.)
 1 (X − X )

Y0 ± t n − 2 ,α 2 MS E 1 + + 0
 n
S xx


2






Example : Wheat yield example

Regression line Y = 36.43 + 0.059 X
Source

SS

d.f.

MS

F-ratio

Regression

974.68

1

974.68

27.805

Error

175.32

5

35.064

Total

1150

6

At X0 = 450,


Y0 = 36
62.43
98 + (0.059)(450)

90% prediction interval

2
 )2 
 1 (1X 0 (−450

 
1]+ )1++ + X −) 400
(.20512
(35
62
[62
50
143
,.02
75
.817
±
Y
t±5, 0±
MS
..98
.)837
E .064
0 .98
 
 n 7 S xx280000



9

Prediction

Multiple Linear Regression
Example : Fuel consumption data
Data Display

Row
1
2
3
4
5
6
7
8
9
10
11
12

State
ME
NH
VT
FUEL
MA
RI
CN
NY
NJ
PA
OH
IN
IL

POP

TAX

1029
9.00
771
9.00
462
9.00
=5787
+
7.50
0
1TAX
968
8.00
3082
10.00
18366
8.00
7367
8.00
11926
8.00
10783
7.00
5291
8.00
11251
7.50

β

β

NLIC

INC

ROAD

540
3.571
441
4.092
268
3.865
+ 3060
DLIC
+
4.870
2
527
4.399
1760
5.342
8278
5.319
4074
5.126
6312
4.447
5948
4.512
2804
4.391
5903
5.126

1.976
1.250
1.586
INC
+
2.351
3
0.431
1.333
11.868
2.138
8.577
8.507
5.939
14.186

β

β

FUELC

β

DLIC

557
52.4781
404
57.1984
259
58.0087
ROAD
+
2396
52.8771
4
397
54.4422
1408
57.1058
6312
45.0724
3439
55.3007
5528
52.9264
5375
55.1609
3068
52.9957
5301
52.4664

ε

…………………………………………………..

Multiple Linear Regression
Example : Fuel consumption data
Regression
Analysis
Analysis of
Variance
The
regression
is
SOURCE
DFequationSS
MS
F
p
FUEL
= 37.7 - 4
3.483991.92
TAX + 1.34997.98
DLIC - 6.65
INC0.000
- 0.242 ROAD
Regression
22.68
Error
43
1892.05
44.00
Predictor
t-ratio
p
Total
47 Coef
5883.96 Stdev
Constant
37.68
18.57
2.03
0.049
TAX
-3.478
1.298
-2.68
0.010
DLIC
1.3366
0.1924
6.95
0.000
Unusual Observations
INC
-6.651
1.723
-3.86
Obs.
TAX
FUEL
Fit Stdev.Fit Residual 0.000
St.Resid
ROAD
-0.241764.7580.3391
37
5.0 63.963
3.723 -0.71
-0.795 0.480
-0.14 X
40
7.0
s = 6.633

96.812 73.371
R-sq = 67.8%

2.102
23.441
3.73R
R-sq(adj) = 64.9%

R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.

10