Lecture 02 2017 18 Ch02 EDA

Exploratory Data Analysis
Prof. dr. Siswanto Agus Wilopo, M.Sc., Sc.D.
Department of Biostatistics, Epidemiology and
Population Health
Faculty of Medicine
Universitas Gadjah Mada
1
sawilopo@yahoo.com

Universitas Gadjah Mada, Faculty of Medicine, Department of Public Health

Table:











Assessing the use of table for each type of data,
Differentiate a frequency distribution,
Create a frequency table from raw data,
Constructs relative frequency, cumulative
frequency and relative cumulative frequency
tables.
Construct grouped frequency tables.
Construct a cross-tabulation table.
Illustrate the use of a contingency table is.
Create table with rank data.

sawilopo@yahoo.com

I: 2013
2
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health


Graph:












Assessing the most appropriate chart for a given data type.
Construct pie charts and simple, clustered and stacked, bar
charts.
Create histograms.
Create step charts and ogives.
Construct time series charts, including statistics process control
(SPC).

Interpret and assess a chart reveals.
Assess the meaning by looking at the ‘shape’ of a frequency
distribution.
Appraise negatively skewed, symmetric and positively skewed
distributions.
Describe a bimodal distribution.
Describe the approximate shape of a frequency distribution from
a frequency table or chart.
Assess whether data is considered a normal distribution.

sawilopo@yahoo.com

I: 2013
3
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numeric Summary:











Describe a summary measure of location is, and understand
the meaning of, and the difference between, the mode, the
median and the mean.
Compute the mode, median and mean for a set of values.
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of location.
Describe what a percentile is, and calculate any given
percentile value.
Describe what a summary measure of spread is
Differentiate the difference between, and can calculate, the
range, the interquartile range and the standard deviation.

Interpret estimate percentile values
Formulate the role of data type and distributional shape in
choosing the most appropriate measure of spread.

sawilopo@yahoo.com

I: 2013
4
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The Big Picture
Recall “The Big Picture,” the four-step process
that encompasses statistics (as it is presented in
this course):

1. Producing Data — Choosing a sample from the
population of interest and collecting data.
2. Exploratory Data Analysis (EDA) or Descriptive

Statistics —
3. Summarizing the data we’ve collected. Probability and
Inference —
4. Drawing conclusions about the entire population
based on the data collected from the sample.

Even though in practice it is the second step in the
process, we are going to look at Exploratory Data
Analysis (EDA) first.
5

sawilopo@yahoo.com

I: 2013
5
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com


6

I: 2013
6
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

7

sawilopo@yahoo.com

I: 2013
7
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

8


sawilopo@yahoo.com

I: 2013
8
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

9

sawilopo@yahoo.com

I: 2013
9
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

10


sawilopo@yahoo.com

I: 2013
10
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Goals of EDA


Exploratory Data Analysis (EDA) is how
we make sense of the data by
converting them from their raw form to
a more informative one.

11

sawilopo@yahoo.com


I: 2013
11
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

EDA consists of:






organizing and summarizing the raw
data,
discovering important features and
patterns in the data and any striking
deviations from those patterns, and then
interpreting our findings in the context of

the problem

12

sawilopo@yahoo.com

I: 2013
12
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

(continued)
And can be useful for:
 describing the distribution of a single
variable (center, spread, shape, outliers)
 checking data (for errors or other
problems)
 checking assumptions to more complex
statistical analyses
 investigating relationships between
variables
13

sawilopo@yahoo.com

I: 2013
13
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

EDA




Exploratory data analysis (EDA) methods are
often called Descriptive Statistics due to the
fact that they simply describe, or provide
estimates based on, the data at hand.
Comparisons can be visualized and values of
interest estimated using EDA but descriptive
statistics alone will provide no information
about the certainty of our conclusions.

14

sawilopo@yahoo.com

I: 2013
14
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Important Features of Exploratory Data
Analysis
There are two important features to the
structure of the EDA unit in this course:
 The material in this unit covers two
broad topics:




Examining Distributions — exploring data one
variable at a time.
Examining Relationships — exploring data two
variables at a time.

15

sawilopo@yahoo.com

I: 2013
15
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Important Features of Exploratory Data
Analysis


In Exploratory Data Analysis, our
exploration of data will always consist
of the following two elements:



visual displays, supplemented by
numerical measures.

Try to remember these structural
themes, as they will help you orient
yourself along the path of this unit.
16

sawilopo@yahoo.com

I: 2013
16
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

EXAMINING DISTRIBUTIONS

sawilopo@yahoo.com

17

I: 2013
17
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Examining Distributions
We will begin the EDA part of the course
by exploring (or looking at) one variable
at a time.
 As we have seen, the data for each
variable consist of a long list of values
(whether numerical or not), and are not
very informative in that form.
18

sawilopo@yahoo.com

I: 2013
18
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Examining Distributions




In order to convert these raw data into
useful information, we need to summarize
and then examine the distribution of the
variable.
By distribution of a variable, we mean:



what values the variable takes, and
how often the variable takes those values.

We will first learn how to summarize and
examine the distribution of a single
categorical variable, and then do the same
for a single quantitative variable.
19

sawilopo@yahoo.com

I: 2013
19
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

ONE CATEGORICAL VARIABLE

sawilopo@yahoo.com

I: 2013
20
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Example:
Distribution of One Categorical Variable




What is your perception of your own
body? Do you feel that you are
overweight, underweight, or about right?
A random sample of 1,200 college
students were asked this question as
part of a larger survey. The following
table shows part of the responses:

21

sawilopo@yahoo.com

I: 2013
21
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Example Raw Data out of 1200 students
Student

Body Image

student 25

overweight

student 26

about right

student 27

underweight

student 28

about right

student 29

about right
22

sawilopo@yahoo.com

I: 2013
22
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health



Here is some information that would be
interesting to get from these data:






What percentage of the sampled students fall into
each category?
How are students divided across the three body
image categories?
Are they equally divided? If not, do the
percentages follow some other kind of pattern?

23

sawilopo@yahoo.com

I: 2013
23
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health





There is no way that we can answer
these questions by looking at the raw
data, which are in the form of a long list
of 1,200 responses, and thus not very
useful.
However, both of these questions will be
easily answered once we summarize and
look at the distribution of the variable
Body Image (i.e., once we summarize
how often each of the categories
occurs).
24

sawilopo@yahoo.com

I: 2013
24
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numerical Measures




In order to summarize the distribution of
a categorical variable, we first create a
table of the different values (categories)
the variable takes, how many times each
value occurs (count) and, more
importantly, how often each value occurs
(by converting the counts to
percentages).
The result is often called a Frequency
Distribution or Frequency Table.
25

sawilopo@yahoo.com

I: 2013
25
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

A Frequency Distribution or Frequency
Table
Category
About right
Overweight
Underweight
Total

Count
855
235
110
n=1200

Percent
(855/1200)*100 = 71.3%
(235/1200)*100 = 19.6%
(110/1200)*100 = 9.2%
100%

26

sawilopo@yahoo.com

I: 2013
26
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Visual or Graphical Displays: Pie Chart

27

sawilopo@yahoo.com

I: 2013
27
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Visual or Graphical Displays

OR

28

sawilopo@yahoo.com

I: 2013
28
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

ONE QUANTITATIVE VARIABLE

sawilopo@yahoo.com

I: 2013
29
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health





To display data from one quantitative
variable graphically, we can use either
a histogram or boxplot.
We will also present several “by-hand”
displays such as the stemplot and
dotplot

30

sawilopo@yahoo.com

I: 2013
30
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numerical Measures




The overall pattern of the distribution of
a quantitative variable is described by
its shape, center, and spread.
By inspecting the histogram or boxplot,
we can describe the shape of the
distribution, but we can only get a rough
estimate for the center and spread.

31

sawilopo@yahoo.com

I: 2013
31
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numerical Measures


A description of the distribution of a
quantitative variable must include, in
addition to the graphical display, a
more precise numerical description of
the center and spread of the
distribution.

32

sawilopo@yahoo.com

I: 2013
32
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numerical Measures







how to quantify the center and spread of
a distribution with various numerical
measures;
some of the properties of those numerical
measures; and
how to choose the appropriate numerical
measures of center and spread to
supplement the histogram.
We will also discuss a few measures of
position or location which allow us to
quantify the where a particular value is in
the distribution of all values.
33

sawilopo@yahoo.com

I: 2013
33
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

How To Create Histograms
Here are the exam grades of 15 students:
88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73
Score

Count

[40-50)

1

[50-60)

2

[60-70)

4

[70-80)

5

[80-90)

2

[90-100)

1
34

sawilopo@yahoo.com

I: 2013
34
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Stemplot (Stem and Leaf Plot)




The stemplot (also called stem and leaf plot) is
another graphical display of the distribution of
quantitative variable.
The idea is to separate each data point into a
stem and leaf, as follows:







The leaf is the right-most digit.
The stem is everything except the right-most digit.
So, if the data point is 34, then 3 is the stem and 4 is the leaf.
If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.

Note: For this to work, ALL data points should
be rounded to the same number of decimal
places.
35

sawilopo@yahoo.com

I: 2013
35
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Stemplot (Stem and Leaf Plot)
EXAMPLE: Best Actress Oscar Winners
 We will use the Best Actress Oscar winners
example
 34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21
41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
To make a stemplot:
 Separate each observation into a stem and a leaf.
 Write the stems in a vertical column with the
smallest at the top, and draw a vertical line at the
right of this column.
 Go through the data points, and write each leaf in
the row to the right of its stem.
 Rearrange the leaves in an increasing order.
36

sawilopo@yahoo.com

I: 2013
36
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

When you rotated 90 degrees counterclockwise, the stemplot
visually resembles a histogram:

37

sawilopo@yahoo.com

I: 2013
37
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Summary Measures
Describing Data Numerically
Central Tendency

Quartiles

Variation

Arithmetic Mean

Range

Median

Interquartile Range

Mode

Variance

Geometric Mean

Standard Deviation

Shape
Skewness

Coefficient of Variation

sawilopo@yahoo.com

I: 2013
38
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Central Tendency

sawilopo@yahoo.com

I: 2013
39
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Measures of Central Tendency
Overview
Central Tendency

Arithmetic Mean

Median

Mode

n

X

X
i1

n

sawilopo@yahoo.com

Geometric Mean
XG  ( X1  X 2    Xn )1/ n

i

Midpoint of
ranked
values

Most
frequently
observed
value

I: 2013
40
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Arithmetic Mean


The arithmetic mean (sample mean)
is the most common measure of
central tendency


For a sample of size n:
n

X
Sample size
sawilopo@yahoo.com

X
i1

n

i

X1  X 2    Xn

n
Observed values

I: 2013
41
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Arithmetic Mean





(continued)

The most common measure of central tendency
Mean = sum of values divided by the number of
values
Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3
1  2  3  4  5 15

3
5
5
sawilopo@yahoo.com

0 1 2 3 4 5 6 7 8 9 10

Mean = 4
1  2  3  4  10 20

4
5
5

I: 2013
42
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Median


In an ordered array, the median is the
“middle” number (50% above, 50%
below)

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Median = 3

Median = 3



Not affected by extreme values

sawilopo@yahoo.com

I: 2013
43
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Finding the Median


The location of the median:

n 1
position in the ordered data
Median position 
2






If the number of values is odd, the median is the middle
number
If the number of values is even, the median is the average of
the two middle numbers

n 1
Note that
is not the value of the median, only
2
the position of the median in the ranked data

sawilopo@yahoo.com

I: 2013
44
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Mode








A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or
categorical (nominal) data
There may be no mode
There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9
sawilopo@yahoo.com

0 1 2 3 4 5 6

No Mode

I: 2013
45
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Problem
Which measure of location
is the “best”?


Mean is generally used, unless
extreme values (outliers) exist



Then median is often used, since the
median is not sensitive to extreme
values.

sawilopo@yahoo.com

I: 2013
46
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Measures of Location
Comparison of Mean and Median
Let use cholesterol data as an example:

145, 159, 166, 166, 195, 205, 250
We found the mean is 183.7 and the
median is 166.
47
sawilopo@yahoo.com

I: 2013
47
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Measures of Location
Comparison of Mean and Median
Suppose we replace 250 with 215:

145, 159, 166, 166, 195, 205, 215
We will find the mean is 178.7 and the
median remains 166.
48
sawilopo@yahoo.com

I: 2013
48
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Geometric Mean


Geometric mean


Used to measure the rate of change of a variable
over time
1/ n

XG  ( X1  X 2    Xn )


Geometric mean rate of return


Measures the status of an investment over time
1/ n

R G  [(1  R1 )  (1  R 2 )    (1  Rn )]


sawilopo@yahoo.com

1

Where Ri is the rate of return in time period i
I: 2013
49
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Example
An investment of $100,000 declined to $50,000 at
the end of year one and rebounded to $100,000
at end of year two:

X1  $100,000

X 2  $50,000

50% decrease

X 3  $100,000

100% increase

The overall two-year return is zero, since it started and
ended at the same level.
sawilopo@yahoo.com

I: 2013
50
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Example

(continued)

Use the 1-year returns to compute the
arithmetic mean and the geometric mean:
Arithmetic
mean rate
of return:

( 50%)  (100%)
X
 25%
2

Geometric
mean rate
of return:

R G  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1

sawilopo@yahoo.com

Misleading result

 [(1  ( 50%))  (1  (100%))]1/ 2  1
 [(.50 )  (2)]1/ 2  1  11/ 2  1  0%

More
accurate
result

I: 2013
51
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

MEASURE OF VARIATION

sawilopo@yahoo.com

I: 2013
52
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Measures of Variation
Variation
Range



Interquartile
Range

Variance

Standard
Deviation

Coefficient
of Variation

Measures of variation give
information on the spread
or variability of the data
values.
Same center,
different variation

sawilopo@yahoo.com

I: 2013
53
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Range



Simplest measure of variation
Difference between the largest and
the smallest values in a set of data:
Range = Xlargest – Xsmallest

Example:
0 1 2 3 4 5 6 7 8 9 10 11 12

13 14

Range = 14 - 1 = 13
sawilopo@yahoo.com

I: 2013
54
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Disadvantages of the Range


Ignores the way in which data are
distributed
7

8

9

10

11

12

Range = 12 - 7 = 5


7

8

9

10

11

12

Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

sawilopo@yahoo.com

I: 2013
55
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Quartiles


Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25%
Q1







25%

25%
Q2

25%
Q3

The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile

sawilopo@yahoo.com

I: 2013
56
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:

Q1 = (n+1)/4

Second quartile position: Q2 = (n+1)/2 (the median position)
Third quartile position:

Q3 = 3(n+1)/4

where n is the number of observed values

sawilopo@yahoo.com

I: 2013
57
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Calculating Quartiles


Example: Find the first quartile

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so

Q1 = 12.5

Q1 and Q3 are measures of noncentral location
Q2 = median, a measure of central tendency
sawilopo@yahoo.com

I: 2013
58
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Quartiles


(continued)

Example:

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5
sawilopo@yahoo.com

I: 2013
59
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Interquartile Range


Can eliminate some outlier problems by
using the interquartile range



Eliminate some high- and low-valued
observations and calculate the range
from the remaining values



Interquartile range = 3rd quartile – 1st quartile

= Q3 – Q1
sawilopo@yahoo.com

I: 2013
60
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Interquartile Range
Example:
X

minimum

Q1

25%

12

Median
(Q2)
25%

30

Q3

25%

45

X

maximum

25%

57

70

Interquartile range
= 57 – 30 = 27

sawilopo@yahoo.com

I: 2013
61
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Variance


Average (approximately) of squared
deviations of values from the mean
n



Sample variance:

Where

2

S 

 (X  X)
i1

i

2

n -1

X = mean
n = sample size
Xi = ith value of the variable X

sawilopo@yahoo.com

I: 2013
62
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Standard Deviation






Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the variance
Has the same units as the original data
n



Sample standard deviation:

sawilopo@yahoo.com

S

2
(X
X
)

 i
i 1

n -1

I: 2013
63
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :

10

12

14

n=8
S

15

17

18

18

24

Mean = X = 16

(10  X )2  (12  X )2  (14  X )2    (24  X )2
n 1



(10  16) 2  (12  16) 2  (14  16) 2    (24  16) 2
8 1



130
7

sawilopo@yahoo.com



4.3095

A measure of the “average”
scatter around the mean

I: 2013
64
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Measuring variation
Small standard deviation
Large standard deviation

sawilopo@yahoo.com

I: 2013
65
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Comparing Standard Deviations
Data A
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
S = 3.338

20 21

Mean = 15.5
S = 0.926

20 21

Mean = 15.5
S = 4.567

Data B
11

12

13

14

15

16

17

18

19

Data C
11

sawilopo@yahoo.com

12

13

14

15

16

17

18

19

I: 2013
66
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Advantages of Variance and
Standard Deviation


Each value in the data set is used in
the calculation



Values far from the mean are given
extra weight
(because deviations from the mean are squared)

sawilopo@yahoo.com

I: 2013
67
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Coefficient of Variation


Measures relative variation



Always in percentage (%)



Shows variation relative to mean



Can be used to compare two or more
sets of data measured in different
units
 S 

CV  
 X   100%



sawilopo@yahoo.com

I: 2013
68
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Comparing Coefficient
of Variation


Hospital A:



Average surplus in the last 10 years = 50 Billion Rp.
Standard deviation = 5 Billion Rp.
Both hospital

S
CVA  
X



5 Bill Rp.
100%  10%
 100% 
50 Bill Rp.


Hospital B:



have the same
standard
deviation, but
hospital B is
less variable
relative to its
surplus

Average surplus last in the last 10 years = 100 Billion Rp.
Standard deviation = 5 Billion Rp.

S
CVB  
X

sawilopo@yahoo.com


5 Bill Rp.
100%  5%
 100% 
100 Bill Rp.


I: 2013
69
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Standardized Scores (Z-Scores)








Z-scores use the mean and standard deviation as the
primary measures of center and spread and are therefore
most useful when the mean and standard deviation are
appropriate, i.e. when the distribution is reasonably
symmetric with no extreme outliers.
For any individual, the z-score tells us how many standard
deviations the raw score for that individual deviates from
the mean and in what direction.
To calculate a z-score, we take the individual value and
subtract the mean and then divide this difference by the
standard deviation.
A positive z-score indicates the individual is above
average and a negative z-score indicates the individual is
below average.

sawilopo@yahoo.com

I: 2013
70
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Z Scores


A measure of distance from the mean (for
example, a Z-score of 2.0 means that a value is 2.0
standard deviations from the mean)



The difference between a value and the mean,
divided by the standard deviation



A Z score above 3.0 or below -3.0 is considered an
outlier

XX
Z
S

sawilopo@yahoo.com

I: 2013
71
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Z Scores

(continued)

Example:


If the mean is 14.0 and the standard deviation is
3.0, what is the Z score for the value 18.5?

X  X 18.5  14.0
Z

 1.5
S
3.0


The value 18.5 is 1.5 standard deviations above the
mean



(A negative Z-score would mean that a value is less
than the mean)

sawilopo@yahoo.com

I: 2013
72
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Quantitative and Graphical Approach:

MEASURE SPREAD AND DISTRIBUTION

sawilopo@yahoo.com

73

I: 2013
73
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

DESCRIBING DISTRIBUTIONS

sawilopo@yahoo.com

74

I: 2013
74
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Features of Distributions of
Quantitative Variables

sawilopo@yahoo.com

I: 2013
75
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Shape


When describing the shape of a
distribution, we should consider:
 Symmetry/skewness of the
distribution.
 Peakedness (modality) — the
number of peaks (modes) the
distribution has.

sawilopo@yahoo.com

I: 2013
76
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Symmetry/skewness of the
distribution.

sawilopo@yahoo.com

I: 2013
77
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com

I: 2013
78
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com

I: 2013
79
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Shape of a Distribution


Describes how data are distributed



Measures of shape


Symmetric or skewed

Left-Skewed

Symmetric

Right-Skewed

Mean < Median

Mean = Median

Median < Mean

sawilopo@yahoo.com

I: 2013
80
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Numerical Measures
for a Population




Population summary measures are called
parameters
The population mean is the sum of the values in the
population divided by the population size, N
N


Where

X
i1

N

i

X1  X 2    XN

N

μ = population mean
N = population size
Xi = ith value of the variable X

sawilopo@yahoo.com

I: 2013
81
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Population Variance


Average of squared deviations of
values from the mean


Population variance:

N

σ2 
Where

2

(X
μ)
 i
i1

N

μ = population mean
N = population size
Xi = ith value of the variable X

sawilopo@yahoo.com

I: 2013
82
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Population Standard Deviation







Most commonly used measure of
variation
Shows variation about the mean
Is the square root of the population
variance
Has the same units as the original data
N



Population standard deviation: σ 

sawilopo@yahoo.com

2

(X
μ)
 i
i1

N

I: 2013
83
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The Sample Covariance




The sample covariance measures the strength of
the linear relationship between two variables
(called bivariate data)
The sample covariance:
n

cov ( X , Y ) 

 ( X  X)( Y  Y )
i1

i

i

n 1



Only concerned with the strength of the relationship



No causal effect is implied

sawilopo@yahoo.com

I: 2013
84
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Interpreting Covariance


Covariance between two random
variables:

cov(X,Y) > 0
X and Y tend to move in
the same direction
cov(X,Y) < 0
X and Y tend to move in
opposite directions
cov(X,Y) = 0
sawilopo@yahoo.com

X and Y are independent

I: 2013
85
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Coefficient of Correlation




Measures the relative strength of the
linear relationship between two
variables
Sample coefficient of correlation:
cov (X , Y)
r
SX SY
n

(X  X)(Y  Y)

where
cov (X
, Y) 
i 1

sawilopo@yahoo.com

i

n

i

n 1

SX 

 (Xi  X)
i1

n 1

n

2

SY 

2

(Y
Y
)
 i
i1

n 1

I: 2013
86
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Features of
Correlation Coefficient, r








Unit free
Ranges between –1 and 1
The closer to –1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the linear
relationship

sawilopo@yahoo.com

I: 2013
87
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Scatter Plots of Data with Various
Correlation Coefficients
Y

Y

r = -1

X

Y

Y

r = -.6

X
Y

Y

r = +1
sawilopo@yahoo.com

X

r=0

X

r = +.3

X

r=0

X

I: 2013
88
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The Empirical Rule




If the data distribution is approximately
bell-shaped, then the interval:
μ  1σ contains about 68% of the values
in the population or the sample

68%

μ

μ  1σ
sawilopo@yahoo.com

I: 2013
89
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The Empirical Rule



μ  2σ contains about 95% of the

values in
μ  3σ the population or the sample
contains about 99.7% of the values in
the population or the sample

sawilopo@yahoo.com

95%

99.7%

μ  2σ

μ  3σ

I: 2013
90
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Chebyshev Rule


Regardless of how the data are
distributed, at least (1 - 1/k2) x 100% of
the values will fall within k standard
deviations of the mean (for k > 1)


Examples:

At least
within
(1 - 1/12) x 100% = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) x 100% = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) x 100% = 89% ………. k=3 (μ ± 3σ)

sawilopo@yahoo.com

I: 2013
91
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

MEASURES OF SPREAD

sawilopo@yahoo.com

92

I: 2013
92
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com

93

I: 2013
93
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Five-Number Summary








The combination of the five numbers (min, Q1,
M, Q3, Max) is called the five number
summary.
It provides a quick numerical description of
both the center and spread of a distribution.
Each of the values represents a measure of
position in the dataset.
The min and max providing the boundaires
and the quartiles and median providing
information about the 25th, 50th, and 75th
percentiles.

sawilopo@yahoo.com

I: 2013
94
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

I: 2013
95
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

I: 2013
96
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Inter-Quartile Range (IQR)

sawilopo@yahoo.com

I: 2013
97
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The 1.5(IQR) Criterion for Outliers


An observation is considered
a suspected outlier or potential
outlier if it is:
below Q1 – 1.5(IQR) or
 above Q3 + 1.5(IQR)


sawilopo@yahoo.com

I: 2013
98
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The following picture (not to scale)
illustrates this rule:

sawilopo@yahoo.com

I: 2013
99
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

EXAMPLE:
Best Actress Oscar Winners
We will continue with the Best Actress Oscar winners example
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33
35 45 49 39 34 26 25 35 33
We can now use the 1.5(IQR) criterion to check whether the
three highest ages should indeed be classified as potential
outliers:
For this example, we found Q1 = 32 and
Q3 = 41.5 which give an IQR = 9.5









sawilopo@yahoo.com

Q1 – 1.5 (IQR) = 32 – (1.5)(9.5) = 17.75
Q3 + 1.5 (IQR) = 41.5 + (1.5)(9.5) = 55.75

The 1.5(IQR) criterion tells us that any
observation with an age that is below
17.75 or above 55.75 is considered a
suspected outlier.
We therefore conclude that the
observations with ages of 61, 74 and
80 should be flagged as suspected
outliers in the distribution of ages.
Note that since the smallest observation is 21,
there are no suspected low outliers in this
distribution.

I: 2013
100
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Possible methods for handling
outliers in practice
Why is it important to identify possible outliers, and how should
they be dealt with? The answers to these questions depend on the
reasons for the outlying values.
Here are several possibilities:
 Even though it is an extreme value, if an outlier can be
understood to have been produced by essentially the same sort
of physical or biological process as the rest of the data, and if
such extreme values are expected to eventually occur again,
then such an outlier indicates something important and
interesting about the process you’re investigating, and it should
be kept in the data.

sawilopo@yahoo.com

I: 2013
101
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health



If an outlier can be explained to have been produced
under fundamentally different conditions from the rest of
the data (or by a fundamentally different process), such
an outlier can be removed from the data if your goal is to
investigate only the process that produced the rest of the
data.



An outlier might indicate a mistake in the data (like a
typo, or a measuring error), in which case it should be
corrected if possible or else removed from the data
before calculating summary statistics or making
inferences from the data (and the reason for the mistake
should be investigated).

sawilopo@yahoo.com

I: 2013
102
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Identification for suspected outliers

BOXPLOTS

103
sawilopo@yahoo.com

I: 2013
103
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

EXAMPLE: Best Actress Oscar Winners
We will use data on the Best Actress Oscar
winners as an example
 34 34 26 37 42 41 35 31 41 33 30 74 33 49
38 61 21 41 26 80 43 29 33 35 45 49 39 34
26 25 35 33
The five number summary of the age of
Best Actress Oscar winners (1970-2001) is:
min = 21, Q1 = 32, M = 35,
Q3 = 41.5, Max = 80
sawilopo@yahoo.com

I: 2013
104
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Box Plot and Outliers






Lines extend from the
edges of the box to the
smallest and largest
observations that were
not classified as
suspected outliers
(using the 1.5xIQR
criterion).
In our example, we have
no low outliers, so the
bottom line goes down
to the smallest
observation, which is
21.
Since we have three
high outliers (61,74, and
80), the top line extends
only up to 49, which is
the largest observation
that has not been
flagged as an outlier.

sawilopo@yahoo.com

I: 2013
105
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The following information is visually
depicted in the boxplot






the five
number
summary
(blue)
the range
and IQR
(red)
outliers
(green)

sawilopo@yahoo.com

I: 2013
106
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Side-by-side boxplots of the age
distributions by gender

sawilopo@yahoo.com

I: 2013
107
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Box Plot Summarized








The five-number summary of a distribution
consists of M, Q1, Q3 and the extremes Min, Max.
The median describes the center, and the
extremes (which give the range) and the quartiles
(which give the IQR) describe the spread.
The boxplot is visually displaying the five number
summary and any suspected outlier using the
1.5(IQR) criterion.
Boxplots presented in side-by-side to compare
and contrast distributions from two or more
groups.

sawilopo@yahoo.com

I: 2013
108
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

ROLE-TYPE CLASSIFICATION

sawilopo@yahoo.com

109

I: 2013
109
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Classification


In most studies involving two variables, each of the
variables has a role. We distinguish between:








the response variable (dependent) — the outcome of the
study; and
the explanatory variable (independent) — the variable that
claims to explain, predict or affect the response.

The variable we wish to predict is commonly called
the dependent variable, the outcome variable, or
the response variable.
Any variable we are using to predict (or explain
differences) in the outcome is commonly called
an explanatory variable, an independent variable,
a predictor variable, or a covariate.

sawilopo@yahoo.com

I: 2013
110
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

If we further classify each of the two relevant
variables according to type (categorical or
quantitative),


We get the following 4 possibilities
for “role-type classification”





Categorical explanatory and quantitative response
Categorical explanatory and categorical response
Quantitative explanatory and quantitative response
Quantitative explanatory and categorical response

sawilopo@yahoo.com

I: 2013
111
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com

I: 2013
112
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Case C→Q:




Exploring the relationship amounts
to comparing the distributions of the
quantitative response variable for each
category of the explanatory variable.
To do this, we use:



sawilopo@yahoo.com

Display: side-by-side boxplots.
Numerical summaries: descriptive statistics of the
response variable, for each value (category) of the
explanatory variable separately.

I: 2013
113
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Case C→C:




Exploring the relationship amounts
to comparing the distributions of the
categorical response variable, for
each category of the explanatory
variable.
To do this, we use:



sawilopo@yahoo.com

Display: two-way table.
Numerical summaries: conditional percentages (of
the response variable for each value (category) of
the explanatory variable separately).

I: 2013
114
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Here is the two-way table for example:

sawilopo@yahoo.com

I: 2013
115
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Another way to visualize the conditional percent, instead of a
table, is the double bar chart

sawilopo@yahoo.com

I: 2013
116
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Case Q→Q



We examine the relationship using:
Display: scatterplot.
When describing the relationship as
displayed by the scatterplot, be sure to
consider:





Overall pattern → direction, form, strength.
Deviations from the pattern → outliers.

Labeling the scatterplot (including a
relevant third categorical variable in our
analysis), might add some insight into the
nature of the relationship.

sawilopo@yahoo.com

I: 2013
117
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Scatter Plot

sawilopo@yahoo.com

I: 2013
118
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

sawilopo@yahoo.com

I: 2013
119
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

Interpreting Scatterplots
• How do we explore the relationship between two
quantitative variables using the scatterplot?
• What should we look at, or pay attention to?

sawilopo@yahoo.com

I: 2013
120
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The direction of the relationship can
be positive, negative, or neither:

sawilopo@yahoo.com

I: 2013
121
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

The strength of the linear relationship

sawilopo@yahoo.com

I: 2013
122
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

In the special case
The scatterplot displays a linear
relationship (and only then), we supplement
the scatterplot with:




Numerical summaries: Pearson’s correlation
coefficient (r) measures the direction and, more
importantly, the strength of the linear relationship.
The closer r is to 1 (or -1), the stronger the positive
(or negative) linear relationship. r is unitless,
influenced by outliers, and should be used only as
a supplement to the scatterplot.

sawilopo@yahoo.com

I: 2013
123
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health

linear relationship and outliers

sawilopo@yahoo.com

I: 2013
124
Universitas Gadjah Mada, Faculty ofBiostatistics
Medicine, Department
of Biostatistics, Epidemiology and Popualtion Health



When the relationship is linear (as
displayed by the scatterplot, and
supported by the correlation r), we can
summarize the linear pattern using
the least squares regression line.


Remember that:




sawilopo@yahoo.com

The slope of the regression line tells us the average cha