Founders of modern statistics
History
Statistics
Ancient Babylonians & Egyptians
Statistics deals with the collection and
analysis of data to solve real-world
problems.
History
$
People
Marine insurance rates determined by past
statistical data (fourteenth century)
People
Blasé Pascal
(1623-1662)
Pr(E) = ?
Pierre de Fermat
(1601-1665)
People
Pierre Simon Laplace
(1749-1827)
Jacob Bernoulli
(1654-1705)
relative frequency
→ probability
Law of large numbers
1
(X 1 + X 2 + m + X n ) → E(X )
n
Karl Friedrich
Gauss
(1777-1855)
1
People
Founders of modern statistics
Height of Son
Study
Sir Francis Galton
(1822 – 1911)
Karl Pearson
(1857-1936)
• Biology
• Evolution
• Genetics
Develop
Height of Father
• MLE
• ANOVA
• Correlation
• Chi-square test
Height of the sons of fathers regressed
towards the mean height of the population
Regression Analysis
Founders of modern statistics
Egon Pearson
(1895-1980)
Develop
Ronald Fisher
(1892-1962)
Era of information
Huge datasets
Data mining
Data reduction
Demographics
• Testing Hypothesis
• Confidence I ntervals
Transaction data
Customer behavior
Multidisciplinary
statistics methodology
Human genome
Satellite photographs
Jerzy Neyman
(1894-1981)
…………………
………..
Elements of Statistics
Experimental Design
Data Collection
Survey Sampling
Observational Study
Statistics
Descriptive Statistics
And Statistical Graphics
Data Analysis
Statistical I nference
• Estimation
• Testing Hypothesis
Descriptive Statistics
Data:
a set of numbers representing characteristics of observations
Year
Tutorial
3
3
3
1
*
2
2
3
1
3
B
B
A
A
C
B
C
B
A
B
2
3
1
1
3
3
3
1
1
1
A
A
B
B
B
B
A
B
B
A
Quiz
Midterm
Final
90.50
96.0
100
96.50
99.0
95
90.00
95.0
100
94.00
99.0
96
97.00
98.0
94
88.50
97.0
96
86.50
100.0
94
90.75
95.0
95
93.00
96.0
93
94.00
92.0
94
……………….…………………..
…………………………………..
53.00
49.5
42
69.50
35.0
44
59.00
35.0
47
48.00
45.0
44
51.50
65.0
27
41.00
31.0
43
56.00
19.0
38
47.50
32.0
27
23.00
29.0
27
39.50
20.0
18
Overall
Grade
96.90
96.50
96.50
96.50
95.80
94.80
94.30
94.15
93.90
93.40
A+
A+
A+
A+
A+
A
A+
A+
A
A
46.45
46.40
45.80
45.10
43.30
39.00
35.90
32.60
26.80
22.90
D
D
D
D
D
F
F
F
F
F
2
Descriptive Statistics
Statistical Graphics
Scales of Measurement
Nominal Scale
Dot diagram
Male/ Female
Chinese/ British/ American/ …
Ordinal Scale
disagree/ agree/ strongly agree
+ - ×÷
97
85
103
97
114
97
120 103
+ - ×÷
low/ medium/ high
I nterval Scale
°C, °F, altitude of a place
+ - ×÷
Ratio Scale
length, weight
+ - ×÷
Frequency Distribution
Table: Grade of Students
Table: Flow of vehicles
Vehicles
Frequency
Percentage
Cars
45
59
MTB > Tally ‘Code-Grade';
SUBC>
Counts;
SUBC>
CumCounts;
SUBC>
Percents;
SUBC>
CumPercents.
Lorries
22
29
Summary Statistics for Discrete Variables
Motorcycles
6
8
Buses
3
4
Total
76
100
80
90
100
110
120
Bar Chart
MTB > Chart Mean( Overall ) * 'Year';
SUBC>
Bar;
SUBC>
Type 1;
SUBC>
EColor 1;
SUBC>
Color 2;
SUBC>
Title "Mean overall scores of Math244 students";
SUBC>
Title "(by year)";
SUBC>
Minimum 2 0.
Grade Grade_n Count CumCnt Percent CumPct
A+
1
7
7
1.39
1.39
A
2
26
33
5.17
6.56
A3
57
90
11.33 17.89
B+
4
92
182
18.29 36.18
B
5
89
271
17.69 53.88
B6
76
347
15.11 68.99
C+
7
58
405
11.53 80.52
C
8
36
441
7.16 87.67
C9
23
464
4.57 92.25
D
10
34
498
6.76 99.01
F
11
5
503
0.99 100.00
N=
503
Pie Chart
Histogram
* * * * Area represent f requency * * * *
MTB > %Pie 'Grade';
SUBC>
Title 'Grades distribution of Math244 students'.
I NCORRECT
3
Percentile
Percentile from raw data
“Joining Mensa Society requires scoring in the
of the standard intelligence test.”
98 th
percentile in one
68, 75, 58, 47, 83, 34, 90, 71, 63, 79
sorted
BMI less than 5th percentile: Underweight*
34
BMI greater than 85th percentile: At risk of Overweight*
BMI greater than 95th percentile: Overweight*
47
58
63
68
71
75
79
30 30
50 50
Xpercentile
X
(1percentile
) ≤ Xpercentile
(percentile
2 ) ≤ h ≤
th
BMI between 5th and 85th percentile: Healthy Weight*
th
th
approximately np observations less than or equal to
(100p)th percentile
th
integer
(100 p )th percentile = X ( ) + f ( X (
fp
Penetration Depth (mm)
Frequency
58 – 60
5
5
60 – 62
3
8
d (np − F p −1 )
62 – 64
6
14
64 – 66
3
17
fp
66 – 68
1
18
×d =
X ( np ) = L p +
fp
np − Fp−1
(X (
Cumulative Frequency
68 – 70
0
18
fp
70 – 72
2
20
d
Total
20
p = 0.75 , np = 15
Blue colored
area = np
− X (r ) )
Table: Frequency table for 20 grain bullet penetration depths
into oak wood from a distance of 15 feet
area on the left of (100p)th percentile = np
np ) − L p )
r +1)
Percentile
Percentile from histogram
(X (
(n )
fractional
r
np − Fp−1
Lp
)
np ) − L p
90
(n + 1) p = r. f
Dataset of n observations
Fp-1
83
Lp + d
L p = 64
75th percentile = 64 +
(100p )th
percentile
Fp −1 = 14
fp = 3 , d = 2
2(15 − 14 )
= 64.67
3
Boxplots
Boxplot
Five number summary
(25th)
(50th)
(75th)
QL
Median
QU
Min
55
60
65
Max
70
75
MTB > Boxplot 'Overall'*'Year';
SUBC>
Transpose;
SUBC>
Box;
SUBC>
Type 1;
SUBC>
Color 2;
SUBC>
Symbol;
SUBC>
Outlier;
SUBC>
Title "Overall scores of Math244 students";
SUBC>
Title "(by year)".
4
Stem-and-Leaf Display
Detecting Outliers
suspected
outliers
78
outlier
65
90
86
*
1.5 I QR
Outer
lower fence
1.5 I QR
1.5 I QR
inner
lower fence
1.5 I QR
inner
upper fence
Stem-and-Leaf Display
Outer
upper fence
79
51
79
5*
1
6
5
2 5
7
8 9 9
8
4 6
6
9
0
10 *
1
62
84
101
N = 10
Leaf unit = 1
stem
leaf
Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D
3*
8
4
0014556779
5*
0138
3+
8
4*
0014
4+
556779
5*
013
5+
8
N = 15
Leaf unit = 1
D
D
A
18%
18%
27%
18%
18%
N = 15
Leaf unit = 1
B
A
37%
27%
Monthly sales of the four brands (in $m)
1.5
1.2
* for 0 – 4
+ for 5 – 9
1.1
1.0
1.0
0.5
0
0
A
Bad Graphic Designs
B
37%
C
C
B
C
D
A
B
C
D
Bad Graphic Designs
5
Bad Graphic Designs
Measures of Central Location
Percentage of college
Enrolment with age 25 and over
Measure of Central Location
30 15
Xi : 10 12 7 300
15
X(i) : 7 10 12 15 30
300
1
(X 1 + X 2 + l + X n )
n
(raw data)
X=
f1 X 1 + f 2 X 2 + l + f n X n
f1 + f 2 + l + f n
(ungrouped frequency table)
X=
f1m1 + f 2 m2 + m + f n mn
f1 + f 2 + m + f n
78
65
90
86
79
(grouped frequency table)
m i = midpoints of i
n odd
51
79
th
class
62
84
101
mode = 79
M = 79
X = 77.5
range = 50 IQR = 22.75 MAD = 10.9 SD = 13.88
1 1− X
range
=MAD
X
51 = 50
IQR
− Q)L===101
.25
10
.988 −
SD
=X=(=nQ
1364
)∑∑
U( X −(−1X)X
n
n even
n
n ni =1 i =1
78
XX ==14
68.8.8
45
110
106
2
i i
79
21
79
32
84
141
range = 120 IQR = 65.25 MAD = 26.9 SD = 35.02
M
M == XX( 3( 3) ) ==12
12
Skewness
Standard Deviation
s=
X=
Measures of Variation
median / 50th percentile / second quartile
X n+1
2
M =
1
X n + X n
+1
2 2
2
mean / average / arithmetic mean
skewed to right
(+ ve skewed)
1 n
2
∑ (X i − X )
n − 1 i =1
skewed to left
(-ve skewed)
Why do we use ( n-1) instead of n ?
n= 1
⇒
SD = 0
more
reasonable
s = undefined
γ1 > 0 γ 2 > 0
X 1 − X = nX − ( X 2 + f X n ) − X
= −( X 2 − X ) − ( X 3 − X ) − h − ( X n − X )
only ( n-1) pieces of information in ∑ (X
n
i =1
− X)
1 n
3
∑ (X i − X )
γ 1 = n i =1
3
(SD )
2
i
γ2 =
γ1 < 0 γ 2 < 0
mean − median
SD
6
Statistics
Ancient Babylonians & Egyptians
Statistics deals with the collection and
analysis of data to solve real-world
problems.
History
$
People
Marine insurance rates determined by past
statistical data (fourteenth century)
People
Blasé Pascal
(1623-1662)
Pr(E) = ?
Pierre de Fermat
(1601-1665)
People
Pierre Simon Laplace
(1749-1827)
Jacob Bernoulli
(1654-1705)
relative frequency
→ probability
Law of large numbers
1
(X 1 + X 2 + m + X n ) → E(X )
n
Karl Friedrich
Gauss
(1777-1855)
1
People
Founders of modern statistics
Height of Son
Study
Sir Francis Galton
(1822 – 1911)
Karl Pearson
(1857-1936)
• Biology
• Evolution
• Genetics
Develop
Height of Father
• MLE
• ANOVA
• Correlation
• Chi-square test
Height of the sons of fathers regressed
towards the mean height of the population
Regression Analysis
Founders of modern statistics
Egon Pearson
(1895-1980)
Develop
Ronald Fisher
(1892-1962)
Era of information
Huge datasets
Data mining
Data reduction
Demographics
• Testing Hypothesis
• Confidence I ntervals
Transaction data
Customer behavior
Multidisciplinary
statistics methodology
Human genome
Satellite photographs
Jerzy Neyman
(1894-1981)
…………………
………..
Elements of Statistics
Experimental Design
Data Collection
Survey Sampling
Observational Study
Statistics
Descriptive Statistics
And Statistical Graphics
Data Analysis
Statistical I nference
• Estimation
• Testing Hypothesis
Descriptive Statistics
Data:
a set of numbers representing characteristics of observations
Year
Tutorial
3
3
3
1
*
2
2
3
1
3
B
B
A
A
C
B
C
B
A
B
2
3
1
1
3
3
3
1
1
1
A
A
B
B
B
B
A
B
B
A
Quiz
Midterm
Final
90.50
96.0
100
96.50
99.0
95
90.00
95.0
100
94.00
99.0
96
97.00
98.0
94
88.50
97.0
96
86.50
100.0
94
90.75
95.0
95
93.00
96.0
93
94.00
92.0
94
……………….…………………..
…………………………………..
53.00
49.5
42
69.50
35.0
44
59.00
35.0
47
48.00
45.0
44
51.50
65.0
27
41.00
31.0
43
56.00
19.0
38
47.50
32.0
27
23.00
29.0
27
39.50
20.0
18
Overall
Grade
96.90
96.50
96.50
96.50
95.80
94.80
94.30
94.15
93.90
93.40
A+
A+
A+
A+
A+
A
A+
A+
A
A
46.45
46.40
45.80
45.10
43.30
39.00
35.90
32.60
26.80
22.90
D
D
D
D
D
F
F
F
F
F
2
Descriptive Statistics
Statistical Graphics
Scales of Measurement
Nominal Scale
Dot diagram
Male/ Female
Chinese/ British/ American/ …
Ordinal Scale
disagree/ agree/ strongly agree
+ - ×÷
97
85
103
97
114
97
120 103
+ - ×÷
low/ medium/ high
I nterval Scale
°C, °F, altitude of a place
+ - ×÷
Ratio Scale
length, weight
+ - ×÷
Frequency Distribution
Table: Grade of Students
Table: Flow of vehicles
Vehicles
Frequency
Percentage
Cars
45
59
MTB > Tally ‘Code-Grade';
SUBC>
Counts;
SUBC>
CumCounts;
SUBC>
Percents;
SUBC>
CumPercents.
Lorries
22
29
Summary Statistics for Discrete Variables
Motorcycles
6
8
Buses
3
4
Total
76
100
80
90
100
110
120
Bar Chart
MTB > Chart Mean( Overall ) * 'Year';
SUBC>
Bar;
SUBC>
Type 1;
SUBC>
EColor 1;
SUBC>
Color 2;
SUBC>
Title "Mean overall scores of Math244 students";
SUBC>
Title "(by year)";
SUBC>
Minimum 2 0.
Grade Grade_n Count CumCnt Percent CumPct
A+
1
7
7
1.39
1.39
A
2
26
33
5.17
6.56
A3
57
90
11.33 17.89
B+
4
92
182
18.29 36.18
B
5
89
271
17.69 53.88
B6
76
347
15.11 68.99
C+
7
58
405
11.53 80.52
C
8
36
441
7.16 87.67
C9
23
464
4.57 92.25
D
10
34
498
6.76 99.01
F
11
5
503
0.99 100.00
N=
503
Pie Chart
Histogram
* * * * Area represent f requency * * * *
MTB > %Pie 'Grade';
SUBC>
Title 'Grades distribution of Math244 students'.
I NCORRECT
3
Percentile
Percentile from raw data
“Joining Mensa Society requires scoring in the
of the standard intelligence test.”
98 th
percentile in one
68, 75, 58, 47, 83, 34, 90, 71, 63, 79
sorted
BMI less than 5th percentile: Underweight*
34
BMI greater than 85th percentile: At risk of Overweight*
BMI greater than 95th percentile: Overweight*
47
58
63
68
71
75
79
30 30
50 50
Xpercentile
X
(1percentile
) ≤ Xpercentile
(percentile
2 ) ≤ h ≤
th
BMI between 5th and 85th percentile: Healthy Weight*
th
th
approximately np observations less than or equal to
(100p)th percentile
th
integer
(100 p )th percentile = X ( ) + f ( X (
fp
Penetration Depth (mm)
Frequency
58 – 60
5
5
60 – 62
3
8
d (np − F p −1 )
62 – 64
6
14
64 – 66
3
17
fp
66 – 68
1
18
×d =
X ( np ) = L p +
fp
np − Fp−1
(X (
Cumulative Frequency
68 – 70
0
18
fp
70 – 72
2
20
d
Total
20
p = 0.75 , np = 15
Blue colored
area = np
− X (r ) )
Table: Frequency table for 20 grain bullet penetration depths
into oak wood from a distance of 15 feet
area on the left of (100p)th percentile = np
np ) − L p )
r +1)
Percentile
Percentile from histogram
(X (
(n )
fractional
r
np − Fp−1
Lp
)
np ) − L p
90
(n + 1) p = r. f
Dataset of n observations
Fp-1
83
Lp + d
L p = 64
75th percentile = 64 +
(100p )th
percentile
Fp −1 = 14
fp = 3 , d = 2
2(15 − 14 )
= 64.67
3
Boxplots
Boxplot
Five number summary
(25th)
(50th)
(75th)
QL
Median
QU
Min
55
60
65
Max
70
75
MTB > Boxplot 'Overall'*'Year';
SUBC>
Transpose;
SUBC>
Box;
SUBC>
Type 1;
SUBC>
Color 2;
SUBC>
Symbol;
SUBC>
Outlier;
SUBC>
Title "Overall scores of Math244 students";
SUBC>
Title "(by year)".
4
Stem-and-Leaf Display
Detecting Outliers
suspected
outliers
78
outlier
65
90
86
*
1.5 I QR
Outer
lower fence
1.5 I QR
1.5 I QR
inner
lower fence
1.5 I QR
inner
upper fence
Stem-and-Leaf Display
Outer
upper fence
79
51
79
5*
1
6
5
2 5
7
8 9 9
8
4 6
6
9
0
10 *
1
62
84
101
N = 10
Leaf unit = 1
stem
leaf
Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D
3*
8
4
0014556779
5*
0138
3+
8
4*
0014
4+
556779
5*
013
5+
8
N = 15
Leaf unit = 1
D
D
A
18%
18%
27%
18%
18%
N = 15
Leaf unit = 1
B
A
37%
27%
Monthly sales of the four brands (in $m)
1.5
1.2
* for 0 – 4
+ for 5 – 9
1.1
1.0
1.0
0.5
0
0
A
Bad Graphic Designs
B
37%
C
C
B
C
D
A
B
C
D
Bad Graphic Designs
5
Bad Graphic Designs
Measures of Central Location
Percentage of college
Enrolment with age 25 and over
Measure of Central Location
30 15
Xi : 10 12 7 300
15
X(i) : 7 10 12 15 30
300
1
(X 1 + X 2 + l + X n )
n
(raw data)
X=
f1 X 1 + f 2 X 2 + l + f n X n
f1 + f 2 + l + f n
(ungrouped frequency table)
X=
f1m1 + f 2 m2 + m + f n mn
f1 + f 2 + m + f n
78
65
90
86
79
(grouped frequency table)
m i = midpoints of i
n odd
51
79
th
class
62
84
101
mode = 79
M = 79
X = 77.5
range = 50 IQR = 22.75 MAD = 10.9 SD = 13.88
1 1− X
range
=MAD
X
51 = 50
IQR
− Q)L===101
.25
10
.988 −
SD
=X=(=nQ
1364
)∑∑
U( X −(−1X)X
n
n even
n
n ni =1 i =1
78
XX ==14
68.8.8
45
110
106
2
i i
79
21
79
32
84
141
range = 120 IQR = 65.25 MAD = 26.9 SD = 35.02
M
M == XX( 3( 3) ) ==12
12
Skewness
Standard Deviation
s=
X=
Measures of Variation
median / 50th percentile / second quartile
X n+1
2
M =
1
X n + X n
+1
2 2
2
mean / average / arithmetic mean
skewed to right
(+ ve skewed)
1 n
2
∑ (X i − X )
n − 1 i =1
skewed to left
(-ve skewed)
Why do we use ( n-1) instead of n ?
n= 1
⇒
SD = 0
more
reasonable
s = undefined
γ1 > 0 γ 2 > 0
X 1 − X = nX − ( X 2 + f X n ) − X
= −( X 2 − X ) − ( X 3 − X ) − h − ( X n − X )
only ( n-1) pieces of information in ∑ (X
n
i =1
− X)
1 n
3
∑ (X i − X )
γ 1 = n i =1
3
(SD )
2
i
γ2 =
γ1 < 0 γ 2 < 0
mean − median
SD
6