Relation of Two Numeric Variables As the values of

8.2 Relation of Two Numeric Variables As the values of

one variable increase, the values

Do the values of two variables tend to change together or separately? Is there a relation between of the other tend

to either increase

Salary and the number of Years employed at the company? First note that both of these variables or decrease.

182 Correlation

are continuous. Two continuous variables are related if, as the values of one variable increase,

+ relation: The

values of both

the values of the other variable tend to either systematically increase, or systematically decrease.

variables tend to increase together.

Relationships can be positive or negative. For a positive relationship, both variables tend to increase together, whereas for a negative relationship the values of the variables tend in opposite

directions.

− relation: As one

variable increases, the values of the other variable tend to decrease.

8.2.1 The Scatter Plot

To illustrate the positive relationship of Salary and Years consider again the Employee data set.

Scenario Assess the relation between annual Salary and Years of experience At a specific company the data table is available for 37 employees, including the variables annual Salary and Years employed. What is the relationship, if any, between the number of Years employed and Salary?

Employee data table, Figure 1.7 ,

The employee data with the associated variable labels are available from within the lessR

p. 21

package.

> mydata <- Read("Employee", format="lessR")

scatter plot: Graph with one

Graphically express this relationship with a scatter plot, a plot of the values of one or more

axis per variable, one plotted point

variables for each row of data, one axis for each variable. Previously we plotted the distribution

for each set of data

of values for a single variable along a single axis with a one-variable scatter plot. The two-variable

values. one-variable scatter

scatter plot illustrates the relation between two variables plotted with two axes. Plot each pair

plot, Section 5.4.1 ,

of data values as a single point with its two coordinates equal to the corresponding two data

p. 110

values. Obtain the scatter plot of two variables with the lessR function ScatterPlot . The

ScatterPlot

abbreviation is sp .

function: Draw a scatter plot for one or two variables.

lessR Input Default scatter plot of Years employed with annual Salary > ScatterPlot(Years, Salary)

or

> sp(Years, Salary)

The scatter plot in Figure 8.1 shows that working more Years tends to be associated with

a higher Salary. Each plotted point represents one employee’s data values for Years and Salary. There are 36 employees with data values for both variables, so the scatter plot consists of 36 points. By default the transparency level of each point is set at 0.66, on a scale from 0 to 1, from no transparency to complete transparency. The default partial transparency distinguishes overlapping points by darker colors.

Consider another version of the Figure 8.1 scatter plot.

lessR Input

> set(colors="gray.black", trans.fill.pt=0) > ScatterPlot(Years, Salary, fit.line="loess")

Correlation 183

100000 y (USD)

0000 8 u al Salar

Years Employed in the Company Figure 8.1 Default (gray scale) scatter plot of Years employed with annual Salary.

The version in Figure 8.2 illustrates the inverted gray scale color theme, optimized for projection onto a screen such as for a slide show. Use the lessR function set to choose the color themes, color theme with the colors options, and the option trans.fill.pt to set the transparency transparency, Section 1.4.1 , level, such as 0 for no transparency.

al Salary (USD) ● u

● ● Ann ● 60000

5 10 15 20 Years Employed in the Company

Figure 8.2 Scatter plot of Years employed and annual Salary and a loess curve with the gray.black color theme and no transparency for the plotted points.

184 Correlation

A smooth curve fit through the points in Figure 8.2 summarizes the relationship between the

fit.line option: Fit

two variables. The fitted curve is called a loess curve for locally weighted scatter plot smoothing.

a curve through the scatter plot,

The position of each point on the curve is computed only from nearby points in the scatter

either a

plot. The result is a non-parametric procedure for summarizing the relationship between the

least-squares line, "ls", or

two variables with a nonlinear curve. For this company, Salary does not increase much each

"loess".

year for the beginning levels of employment, but increases noticeably after five or so years.

8.2.2 The Correlation Coefficient

linear

For two related variables, such as Years and Salary, what is the extent of the relationship? This

relationship: For a two-variable

assessment provided here is for a linear relationship. A perfect linear relationship means that all

scatter plot, a

the points in the scatter plot lie on a straight line. No linear relationship means that there is no

straight line summarizes the

tendency for the value of one variable to systematically increase or decrease as the value of the

relationship.

other variable increases. As the value of one variable increases, the value of the other variable

correlation

is just as likely to increase as decrease.

coefficient: Index of the extent of the

There are several implementations of a correlation coefficient. The correlation coefficient

relationship

that gauges the magnitude of a linear relationship is the Pearson product-moment correlation

between two variables, which

coefficient, indicated in the sample with r. Pearson correlation coefficients of 1 and − 1 indicate

varies from −1 to 0

perfect positive and negative linear relationships, respectively. A correlation of 0 indicates no

to 1.

linear relationship between the variables. Obtain the Pearson correlation coefficient, and its inferential analysis, from the ScatterPlot

Correlation

function. To obtain only text output, without the scatter plot, use the lessR function

function: Provides analysis of the

Correlation . The Correlation function also provides what is called the covariance, a version

correlation

of the correlation coefficient in which the variables are expressed in their original measurement

coefficient without the scatter plot.

units. The abbreviation for Correlation is cr . To illustrate, consider a scatter plot of two uncorrelated variables. Here we generate the data by simulating the process of randomly sampling from a normal distribution with the R function rnorm . Randomly generate 250 simulated data values for a variable X from a normal population with a mean of µ= 50 and a standard deviation of σ=

10. Then randomly sample another 250

assignment

values from the same population for a variable Y. Use the R assignment operator, <- , to store the

operator, Section 1.3.3 , p. 9

generated values in either X or Y.

R Input Generate randomly sampled data values from a normal distribution > X <- rnorm(n=250, mean=50, sd=10)

> Y <- rnorm(n=250, mean=50, sd=10)

To view the actual simulated data values, enter the name of the variable at the command prompt, > . The data values are randomly generated, so a different set of data values are generated

user’s

for each repetition of these function calls.

workspace: Area to store data

The creation of the variables X and Y illustrates another property of the R environment.

frames and other

The (simulated) data values for these variables do not exist in a data frame such as mydata .

created objects.

Instead they exist in a workspace called the user’s workspace, or sometimes the global environment.

ls function: List the contents of a

This space can be used for creating variables and other objects apart from a data frame.

workspace.

The user’s workspace is also where the default mydata data frame is stored. To view the

Correlation 185

contents of this workspace at any time during an R session use the list function ls without any arguments.

> ls() Most lessR functions that process data can also access variables stored directly in the

user workspace instead of within a data frame in that workspace. These functions by default automatically scan for the specified variables in both the specified or default data frame as well as the user’s workspace. For example, ScatterPlot can access any variable in mydata as well as these variables X and Y in the user’s workspace.

95% data

Consider the 95% data ellipse drawn around the points in the scatter plot. The purpose ellipse: Contains about 95% of the of this ellipse is to provide a visual summary of the extent of the relationship between the points in a sample

two variables. The narrower is the ellipse, the stronger is the relationship. For a perfect linear scatter plot of two

normally

relationship the ellipse collapses to a line. For two uncorrelated variables measured on the same distributed scale, the ellipse expands outward to a circle. variables.

The function that draws this ellipse ( dataEllipse ) is from John Fox’s car package (Fox & option: Draw the Weisberg, 2011), which is automatically downloaded when lessR is downloaded. 1 The lessR 95% data ellipse

ellipse=TRUE

ScatterPlot around the values function accesses the car ellipse function directly from its ellipse option, which

in a scatter plot.

is set to TRUE to draw the ellipse. Change the default color of the ellipse with the col.ellipse option. By default the ellipse is filled with a light, transparent color, which can be turned off with fill.ellipse=FALSE .

col.ellipse, fill.ellipse

options: Set the color of the ellipse

lessR Input Scatter plot with ellipse

and specify if to be filled.

> ScatterPlot(X, Y, ellipse=TRUE)

Figure 8.3 Scatter plot with ellipse for a correlation very near zero, r = 0.016.

186 Correlation

The simulated data values for variables X and Y are randomly and separately sampled, so there is no correlation between X and Y in the population. Sampling error is always present, so the sample correlation differs from its corresponding population value of 0. For a sample of 250 paired data values there should be a sample correlation between the variables X and Y that is

close to but not exactly equal to 0. The obtained correlation of r = 0 . 016 for this sample appears in the text output for ScatterPlot , partially reproduced in Listing 8.1 .

Number of paired values with neither missing, n: 250 Number of cases (rows of data) deleted: 0

Sample Correlation of X and Y: r = 0.016

Listing 8.1 Descriptive statistics from ScatterPlot.

The 95% data ellipse in Figure 8.3 is almost a circle, which indicates the lack of relationship between the variables X and Y. For any one specific value of X, the corresponding value of Y is as likely to be larger than its mean near 50 as it is to be smaller than its mean. The circular form of this ellipse is the geometric expression of the corresponding correlation coefficient of

almost zero, r = 0 . 016.

To view the data ellipse for a well-defined positive relationship, consider again the scatter plot of Years and Salary from Figures 8.1 and 8.2 , but now in Figure 8.4 with a scatter plot added by setting ellipse=TRUE in the call to ScatterPlot .

y (USD)

al Salar u

Ann

40000 5 10 15 20 Years Employed in the Company

Figure 8.4 Scatter plot of Years employed and annual Salary with the default data ellipse. One issue with including the ellipse in the scatter plot is that the default limits of both axes

are set according to the range of the data values along each respective axis. Yet to include the

xlim, ylim

entire ellipse often requires larger limits for the axes beyond the range of data. Explicitly set the

options: Set the limits of the axes.

limits of the axes with the standard R options xlim and ylim to specify both the minimum and maximum values of the axes.

Correlation 187

lessR Input Scatter plot with explicit limits for each axis > ScatterPlot(Years, Salary, ellipse=TRUE,

xlim=c(-10,25), ylim=c(10000,125000))

c function,

As always with R , enclose a list of values separated by commas with the combine function c . Section 1.3.6 , p. 15 The result appears in Figure 8.5 .

100000 y (USD)

al Salar u

Years Employed in the Company Figure 8.5 Scatter plot of Years employed and annual Salary with the data ellipse and explicit axes limits.

Compared to the almost circular ellipse in Figure 8.3 , the ellipse in Figure 8.5 is much narrower. This narrow ellipse is the geometric representation of a large correlation, which for

Years and Salary is r = 0 . 852, as reported in Listing 8.2 .

Number of paired values with neither missing, n: 36 Number of cases (rows of data) deleted: 1

Pearson’s product-moment correlation Years, Annual Salary (USD)

Salary, Years Employed in the Company Sample Correlation of Years and Salary: r = 0.852

Listing 8.2 Background information and sample estimate of the correlation coefficient from ScatterPlot and also Correlation.

Text output in Listing 8.2 is from either ScatterPlot or Correlation . The latter function does not provide the scatter plot.

188 Correlation

8.2.3 Inferential Analysis

As is true of virtually any data analysis, the goal is not a description of a sample but information regarding the population from which the sample was obtained. The sample correlation is

r = 0 . 852, but what is the population value of interest? As usual, address this question with

a hypothesis test and/or confidence interval as in Listing 8.3 from either ScatterPlot or Correlation .

Alternative Hypothesis: True correlation is not equal to 0 t-value: 9.501,

df: 34, p-value: 0.000

95% Confidence Interval of Population Correlation Lower Bound: 0.727

Upper Bound: 0.923

Listing 8.3 Inferential analysis of the correlation coefficient from ScatterPlot and also Correlation.

The hypothesis test focuses on the null hypothesis of a population value of 0, of no linear relationship between the two variables. The analysis in Listing 8.3 is a two-tailed test in which substantially large differences in either the positive or negative direction from 0 lead to rejection of 0 as the population value.

Years and Salary Relationship: p -value = 0 . 000 <α= 0 . 05 , reject H 0

The rejection of the null hypothesis implies that Years and Salary are related. What is the extent of the relationship? The confidence interval for the population value of the correlation provides the answer as best can be ascertained from the available information. From Listing 8.3 , we are 95% confident that the true correlation between Years and Salary is between 0.73 and

0.92, so the correlation is rather substantial.

alternative option,

Specify one-tailed hypothesis tests in the R environment with the alternative option. The

Section 6.2.2 , p. 126

values for the alternative option that specify a one-tailed test are "less" and "greater" . The advantage of a one-tailed test is that a difference in the direction of the rejection region is easier to detect than the two-tailed alternative. The disadvantage of the one-tailed test is that if

a difference in the opposite direction occurs, then the null hypothesis cannot be rejected, so the difference, even if substantial, is not detected. Usually the researcher would find a substantial deviation from the null in either direction to be of interest, regardless whether the deviation is predicted or desired, so usually a two-tailed test is preferred.

8.2.4 The Scatter Plot with a Classification Variable

Points in a scatter plot can also be plotted with different plotting symbols and/or colors according to different values of a third variable, usually represented by a few number of

by option: Specify

categories. Specify a third variable with the by option. For example, view the scatter plot of

a categorical variable for which

Years and Salary with different plot symbols for men and women.

to classify each point in the scatter plot.

lessR Input A scatter plot with Gender as a classification variable > ScatterPlot(Years, Salary, by=Gender)

Correlation 189

The resulting scatter plot appears in Figure 8.6 . By default, points for the first level, females, are plotted with a circle, whereas a diamond is used to plot paired data values for the second level, males. By default the color of the plotted points is that provided by the current color theme and transparency level.

100000 y (USD)

Gender

u al Salar Ann

Years Employed in the Company Figure 8.6 Scatter plot of Years and Salary with different plotting symbols for Gender.

The plot symbol used to plot the points at each level of the classification variable can be shape.pts option: customized with the shape.pts option. Valid values are "circle" , "square" , "diamond" , Plot symbol for

"triup" each level of the , and "tridown" , which also apply to the point shapes when there is no by variable.

by variable.

The order of the specified symbols reflects the order in which the levels of the classification variable are defined. By default R defines the order of the levels alphabetically, though the order factor function for can be changed with the factor function.

ordering levels,

If the classification variable is a factor, verify the order of the levels with the standard R Section 3.3.2 , p. 60

function levels . Because levels is an R function, the location of the referenced variable must

be specified, which by default is the data frame mydata . For example, precede the name of the levels function: Displays the levels

referenced variable with mydata$ as in Listing 8.4 .

of a factor variable.

> levels(mydata$Gender) [1] "F" "M"

Listing 8.4 The levels function lists the levels of a factor.

According to Listing 8.4 , the ordering of the levels for Gender is the default alphabetical order. To specify custom plot symbols with the shape.pts option, specify the order of the symbols with the symbol for females listed first. In this example use X for females and Y for males.

cex option: Specify

The size of these letters plotted as symbols tends to be a little large. Reduce their size using the size of the graphics symbol in the standard R graphics option cex , which controls the size of the plotted symbols. The default relation to 1.0. value is 1.0, so to reduce the size to 80% of the default specify a value of cex=.8 .

190 Correlation

lessR Input Scatter plot with custom symbols for a classification variable > ScatterPlot(Years, Salary, by=Gender,

shape.pts=c("X","Y"), cex=.8)

The resulting scatter plot appears in Figure 8.7 .

YX

y (USD)

Y X Gender

Y al Salar Y u

5 10 15 20 Years Employed in the Company

Figure 8.7 Scatter plot of Years and Salary with custom plotting symbols for Gender.

An additional customization of a scatter plot with a classification variable is to provide

color choices,

custom colors for symbols plotted for each level of the variable. To do so use the option

Section 1.1 , p. 17

col.stroke , one color for each level of the by variable. If points are plotted, the default setting of the fill color is relied upon, which is the transparency level applied to col.stroke . Again, the order of the specified colors is the order of the values of the by variable.

Custom symbols and custom colors can be specified for the same plot. A loess or least-squares line of best fit can also be specified with the fit.line option. The line of best fit is plotted for each value of the by variable in the specified color. To emphasize the fit lines and deemphasize the points, plot the points at only 50% of the default point size.

lessR Input Scatter plot with custom symbols and colors > ScatterPlot(Years, Salary, by=Gender, fit.line="loess",

col.stroke=c("black", "gray60"), shape.pts=c("X","Y"), cex=.5)

Viewing Figures 8.6 , 8.7 , or 8.8 reveals that at this company three of the highest four salaries are for men and three of the lowest four salaries are for women. The scatter plot also reveals,

Correlation 191

YY

100000 y (USD)

Gender

0000 F

al Salar Y u

Y X Ann Y Y X 60000

Years Employed in the Company Figure 8.8 Scatter plot of Years and Salary with custom plotting symbols and color for Gender and loess lines

of best fit.

however, that women tend to be concentrated at the lower end of the axis for the number of Years employed. The eight employees with the least Years of employment are all women.

8.2.5 Scatter Plot of Likert Data

Likert data are the responses to attitude items on the Disagree/Agree continuum, usually 4 to 7 different scale points, coded as integer values. A traditional scatter plot of two variables assessed Likert data, on a Likert scale is not usually informative because of the small number of potential unique Section 1.6.6 , p. 27 data values. A scatter plot of a 5-point Likert scale yields only 25 unique combinations of data values for the two variables. Without some transparency of the plotted points there is no way to tell if a plotted point, such as for a 3 on the first item and a 2 on the second item, represents one set of data values or many. Even with transparency most of the 25 positions for plotted points in this example may be filled over with many replications.

bubble plot: The

The lessR solution to this problem automatically provides a form of a scatter plot called a size of each plotted point in a scatter bubble plot if there are less than 10 unique values for each variable. The larger the frequency, plot reflects the larger the size of each plotted point, that is, bubble.

another value such as frequency.

To illustrate we return to the Machiavellianism data set where each item is assessed with values from 0 to 5. The two most highly correlated Mach IV items are m06 and m07 .

Mach IV data, Listing 1.8 , p. 27

Scenario Investigate the relationship between Mach IV Items 6 and 7

6. Honesty is the best policy in all cases. 7. There is no excuse for lying to someone else.

These items correlate r = 0 .

53. Explore their relationship graphically.

192 Correlation

First read the Machiavellianism data directly from within lessR .

> mydata <- Read("Mach4", format="lessR")

Obtain the scatter plot with the ScatterPlot function. To further summarize the relation- ship, include a loess line.

lessR Input Bubble plot of Likert data with loess fit line > ScatterPlot(m06, m07, fit.line="loess")

Their scatter plot appears in Figure 8.9 . Because there are only six unique values for each variable, which is less than the threshold of 10, the ScatterPlot function automatically provides a bubble plot form of the scatter plot.

u se for lying to someone else.

There is no exc

0 1 2 3 4 5 Honesty is the best policy in all cases.

Figure 8.9 Scatter plot in the form of a bubble plot of Mach IV items m06 and m07 with the loess fit line.

The linear relationship of m06 and m07 is apparent from Figure 8.9 as the largest bubbles lie along the increasing diagonal axis. Consistent with this pattern, the loess fit line is approximately linear along this diagonal. The largest bubble is for the Agree response to each item, coded as a 4. Only one possible pair of data values did not occur at all for these 351 respondents, the pairing of Strongly Disagree coded as a 0 for m06 and Agree coded as a 4 for m07 .

The ScatterPlot function kind option provides an override for the default style of the

kind option: Specify the type of

scatter plot. By default a regular scatter plot is produced when at least one of the two variables

scatter plot,

has 10 or more unique values. If both variables have less than 10 unique values, then a bubble

"regular", "bubble", or

plot is produced. For any analysis, to specify a regular plot, set kind="regular" . The other

"sunflower".

values are "bubble" and "sunflower" .

Correlation 193

The sunflower plot is an alternative to the bubble plot. A sunflower plot plots each point as an iconic sunflower such that the more petals the larger the value of another variable. Here the number of petals relates to the joint frequency of the paired data values for the corresponding point.

lessR Input Sunflower plot of Likert data > ScatterPlot(m06, m07, kind="sunflower")

The sunflower plot of m06 and m07 appears in Figure 8.10 . The point with the largest frequency, <4,4> , has so many petals that it almost appears as a filled circle.

u se for lying to someone else.

012345 There is no exc

Honesty is the best policy in all cases. Figure 8.10 Scatter plot in the form of a sunflower plot of Mach IV items m06 and m07.

Figures 8.9 and 8.10 provide useful scatter plots of Likert data. Also of interest are the actual joint frequencies, the count of each possible pair of data values. The previously discussed lessR Cross-tabulation function BarChart applied to two categorical variables generates the cross-tabulation table of table, Section 4.3 , p. 87 joint frequencies.

lessR Input Cross-tabulation of Likert data > BarChart(m06, m07)

The text output of BarChart includes the cross-tabulations in Figure 8.5 . BarChart always tabulates the joint frequencies of the specified variables, regardless whether

the type of variable is integer, real number with decimal digits, or a factor. This same cross-tabulation table can also be obtained from SummaryStats . An issue, however, is that

194 Correlation

m06, Honesty is the best policy in all cases. m07, There is no excuse for lying to someone else. --------------------------------------------------

Listing 8.5 Cross-tabulation table for Mach IV items m06 and m07.

SummaryStats adjusts the provided analysis to the type of variable. The cross-tabulation table is by default only provided for categorical variables formally defined as factors, yet the Likert data in the Machiavellian data set are coded as integers.

To inform SummaryStats to treat the variables as categorical for purposes of this one

n.cat option,

analysis, include the n.cat=6 specification as part of the function call. To view only the cross-

Section 2.2.7 , p. 39

tabulation table, add the brief=TRUE option, or use the abbreviation ss.brief when calling the function.

> SummaryStats(m06, m07, n.cat=6)

The n.cat option provides for the treatment of an integer variable as categorical. Its value specifies the maximum number of unique values a variable can have and still be treated as categorical. Here the Likert data reflect a six-point Likert scale.