Founders of modern statistics

History

Statistics

Ancient Babylonians & Egyptians

Statistics deals with the collection and
analysis of data to solve real-world
problems.

History

$

People

Marine insurance rates determined by past
statistical data (fourteenth century)

People


Blasé Pascal
(1623-1662)

Pr(E) = ?

Pierre de Fermat
(1601-1665)

People
Pierre Simon Laplace
(1749-1827)

Jacob Bernoulli
(1654-1705)

relative frequency 
→ probability
Law of large numbers

1

(X 1 + X 2 + m + X n ) → E(X )
n

Karl Friedrich
Gauss
(1777-1855)

1

People

Founders of modern statistics

Height of Son

Study

Sir Francis Galton
(1822 – 1911)


Karl Pearson
(1857-1936)

• Biology
• Evolution
• Genetics
Develop

Height of Father

• MLE
• ANOVA
• Correlation
• Chi-square test

Height of the sons of fathers regressed
towards the mean height of the population

Regression Analysis


Founders of modern statistics
Egon Pearson
(1895-1980)

Develop

Ronald Fisher
(1892-1962)

Era of information
Huge datasets

Data mining

Data reduction

Demographics

• Testing Hypothesis
• Confidence I ntervals


Transaction data
Customer behavior

Multidisciplinary
statistics methodology

Human genome
Satellite photographs

Jerzy Neyman
(1894-1981)

…………………
………..

Elements of Statistics
Experimental Design
Data Collection


Survey Sampling
Observational Study

Statistics
Descriptive Statistics
And Statistical Graphics
Data Analysis

Statistical I nference
• Estimation
• Testing Hypothesis

Descriptive Statistics
Data:

a set of numbers representing characteristics of observations
Year

Tutorial


3
3
3
1
*
2
2
3
1
3

B
B
A
A
C
B
C
B
A

B

2
3
1
1
3
3
3
1
1
1

A
A
B
B
B
B
A

B
B
A

Quiz

Midterm

Final

90.50
96.0
100
96.50
99.0
95
90.00
95.0
100
94.00

99.0
96
97.00
98.0
94
88.50
97.0
96
86.50
100.0
94
90.75
95.0
95
93.00
96.0
93
94.00
92.0
94

……………….…………………..
…………………………………..
53.00
49.5
42
69.50
35.0
44
59.00
35.0
47
48.00
45.0
44
51.50
65.0
27
41.00
31.0
43
56.00
19.0
38
47.50
32.0
27
23.00
29.0
27
39.50
20.0
18

Overall

Grade

96.90
96.50
96.50
96.50
95.80
94.80
94.30
94.15
93.90
93.40

A+
A+
A+
A+
A+
A
A+
A+
A
A

46.45
46.40
45.80
45.10
43.30
39.00
35.90
32.60
26.80
22.90

D
D
D
D
D
F
F
F
F
F

2

Descriptive Statistics

Statistical Graphics

Scales of Measurement
Nominal Scale

Dot diagram

Male/ Female
Chinese/ British/ American/ …

Ordinal Scale

disagree/ agree/ strongly agree

+ - ×÷

97

85

103

97

114

97

120 103

+ - ×÷

low/ medium/ high
I nterval Scale

°C, °F, altitude of a place

+ - ×÷

Ratio Scale

length, weight

+ - ×÷

Frequency Distribution
Table: Grade of Students
Table: Flow of vehicles
Vehicles

Frequency

Percentage

Cars

45

59

MTB > Tally ‘Code-Grade';
SUBC>
Counts;
SUBC>
CumCounts;
SUBC>
Percents;
SUBC>
CumPercents.

Lorries

22

29

Summary Statistics for Discrete Variables

Motorcycles

6

8

Buses

3

4

Total

76

100

80

90

100

110

120

Bar Chart
MTB > Chart Mean( Overall ) * 'Year';
SUBC>
Bar;
SUBC>
Type 1;
SUBC>
EColor 1;
SUBC>
Color 2;
SUBC>
Title "Mean overall scores of Math244 students";
SUBC>
Title "(by year)";
SUBC>
Minimum 2 0.

Grade Grade_n Count CumCnt Percent CumPct
A+
1
7
7
1.39
1.39
A
2
26
33
5.17
6.56
A3
57
90
11.33 17.89
B+
4
92
182
18.29 36.18
B
5
89
271
17.69 53.88
B6
76
347
15.11 68.99
C+
7
58
405
11.53 80.52
C
8
36
441
7.16 87.67
C9
23
464
4.57 92.25
D
10
34
498
6.76 99.01
F
11
5
503
0.99 100.00
N=
503

Pie Chart

Histogram
* * * * Area represent f requency * * * *

MTB > %Pie 'Grade';
SUBC>
Title 'Grades distribution of Math244 students'.

I NCORRECT

3

Percentile

Percentile from raw data

“Joining Mensa Society requires scoring in the
of the standard intelligence test.”

98 th

percentile in one

68, 75, 58, 47, 83, 34, 90, 71, 63, 79
sorted

BMI less than 5th percentile: Underweight*

34

BMI greater than 85th percentile: At risk of Overweight*
BMI greater than 95th percentile: Overweight*

47

58

63

68

71

75

79

30 30
50 50
Xpercentile
X
(1percentile
) ≤ Xpercentile
(percentile
2 ) ≤ h ≤
th

BMI between 5th and 85th percentile: Healthy Weight*

th

th

approximately np observations less than or equal to
(100p)th percentile

th

integer

(100 p )th percentile = X ( ) + f ( X (

fp

Penetration Depth (mm)

Frequency

58 – 60

5

5

60 – 62

3

8

d (np − F p −1 )

62 – 64

6

14

64 – 66

3

17

fp

66 – 68

1

18

×d =

X ( np ) = L p +

fp

np − Fp−1

(X (

Cumulative Frequency

68 – 70

0

18

fp

70 – 72

2

20

d

Total

20

p = 0.75 , np = 15
Blue colored
area = np

− X (r ) )

Table: Frequency table for 20 grain bullet penetration depths
into oak wood from a distance of 15 feet

area on the left of (100p)th percentile = np
np ) − L p )

r +1)

Percentile

Percentile from histogram
(X (

(n )

fractional

r

np − Fp−1

Lp

)
np ) − L p

90

(n + 1) p = r. f

Dataset of n observations

Fp-1

83

Lp + d

L p = 64

75th percentile = 64 +

(100p )th
percentile

Fp −1 = 14

fp = 3 , d = 2

2(15 − 14 )
= 64.67
3

Boxplots

Boxplot
Five number summary
(25th)

(50th)

(75th)

QL

Median

QU

Min

55

60

65

Max

70

75

MTB > Boxplot 'Overall'*'Year';
SUBC>
Transpose;
SUBC>
Box;
SUBC>
Type 1;
SUBC>
Color 2;
SUBC>
Symbol;
SUBC>
Outlier;
SUBC>
Title "Overall scores of Math244 students";
SUBC>
Title "(by year)".

4

Stem-and-Leaf Display

Detecting Outliers
suspected
outliers

78

outlier

65

90

86

*

1.5 I QR

Outer
lower fence

1.5 I QR

1.5 I QR

inner
lower fence

1.5 I QR

inner
upper fence

Stem-and-Leaf Display

Outer
upper fence

79

51

79

5*

1

6

5
2 5

7

8 9 9

8

4 6
6

9

0

10 *

1

62

84

101

N = 10
Leaf unit = 1

stem

leaf

Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D

3*

8

4

0014556779

5*

0138

3+

8

4*

0014

4+

556779

5*

013

5+

8

N = 15
Leaf unit = 1

D

D

A

18%

18%

27%

18%

18%

N = 15
Leaf unit = 1

B

A

37%

27%

Monthly sales of the four brands (in $m)
1.5

1.2

* for 0 – 4
+ for 5 – 9

1.1

1.0

1.0

0.5

0

0
A

Bad Graphic Designs

B
37%

C

C

B

C

D

A

B

C

D

Bad Graphic Designs

5

Bad Graphic Designs

Measures of Central Location

Percentage of college
Enrolment with age 25 and over

Measure of Central Location

30 15
Xi : 10 12 7 300
15

X(i) : 7 10 12 15 30
300

1
(X 1 + X 2 + l + X n )
n

(raw data)

X=

f1 X 1 + f 2 X 2 + l + f n X n
f1 + f 2 + l + f n

(ungrouped frequency table)

X=

f1m1 + f 2 m2 + m + f n mn
f1 + f 2 + m + f n

78

65

90

86

79

(grouped frequency table)

m i = midpoints of i

n odd

51

79

th

class

62

84

101

mode = 79

M = 79

X = 77.5

range = 50 IQR = 22.75 MAD = 10.9 SD = 13.88
1 1− X
range
=MAD
X
51 = 50
IQR
− Q)L===101
.25
10
.988 −
SD
=X=(=nQ
1364
)∑∑
U( X −(−1X)X
n

n even

n

n ni =1 i =1

78

XX ==14
68.8.8

45

110

106

2

i i

79

21

79

32

84

141

range = 120 IQR = 65.25 MAD = 26.9 SD = 35.02

M
M == XX( 3( 3) ) ==12
12

Skewness

Standard Deviation
s=

X=

Measures of Variation

median / 50th percentile / second quartile

 X  n+1 


  2 
M = 

1
  X  n  + X  n  
 +1 
 2   2 
2  

mean / average / arithmetic mean

skewed to right
(+ ve skewed)

1 n
2
∑ (X i − X )
n − 1 i =1

skewed to left
(-ve skewed)

Why do we use ( n-1) instead of n ?
n= 1



SD = 0

more
reasonable

s = undefined

γ1 > 0 γ 2 > 0

X 1 − X = nX − ( X 2 + f X n ) − X

= −( X 2 − X ) − ( X 3 − X ) − h − ( X n − X )

only ( n-1) pieces of information in ∑ (X
n

i =1

− X)

1 n
3
∑ (X i − X )
γ 1 = n i =1
3
(SD )

2

i

γ2 =

γ1 < 0 γ 2 < 0

mean − median
SD

6