The Correlation Matrix

8.3 The Correlation Matrix

Up until this point in this chapter the focus has been on the linear relation of two variables with each other, in terms of their scatter plot and their Pearson correlation coefficient. Often there are many variables of interest such as the items on an attitude survey. In this situation the focus shifts from just two variables to the relations between all the pairs of variables of interest.

We begin with the correlation coefficients of all the pairwise combinations of the variables. The same lessR function Correlation that calculates a single correlation coefficient between two variables also calculates a correlation matrix, a table of all the possible correlation coefficients between all pairs of the specified variables. To calculate this matrix, in the function call to Correlation pass a single list of variables instead of two separate variables. Or, do not specify any variables.

Going one step further, if no arguments are specified in the call to the function Correlation ,

assignment

the correlation matrix for all numerical variables in the default data frame mydata is calculated.

operator, Section 1.3.3 , p. 9

Usually write this matrix to the matrix mycor as specified by the R assignment operator, <- , though any valid name can be specified.

Correlation 195

lessR Input Correlation matrix from all numerical variables in mydata > mycor <- Correlation()

The Correlation function scans the input data frame for non-numerical variables, notes the existence of such variables, and then excludes them from further analysis. By default, the variables to be analyzed are in the data frame mydata , which can be changed by the usual lessR option data . The matrix mycor is an R square matrix with the default name for the lessR factor and item analysis procedures. A matrix is a simpler version of a data frame matrix: A storage in that all the columns in the matrix must be of the same data type. In mycor all the entries container in which all the entries are are real numbers, that is, numeric with decimal digits. By default the number of digits for each of the same data calculated correlation coefficient is two, a value that can be changed with the usual lessR type. option digits.d .

digits.d option,

If not all numerical variables in the input data frame are to be included in the correlation Section 1.3.5 , p. 14

analysis, specify a list of variables that exist within the input data frame. A variable list can be written several different ways. As always in R , if commas delineate any of the variables in the

list, enclose the list with the c function. For example, list the four variables on the Mach IV

c function,

Deceit subscale followed by the two items of the Flattery subscale.

Section 1.3.6 , p. 15

Mach IV subscales,

c(m06, m07, m09, m10, m15, m02)

Section 11.10, p. 273

A list of contiguous variables in the data frame can be specified with the : notation instead of listing each variable name individually. Here specify the first 10 Mach IV variables.

m01:m10 Or, the c function and the : notation can be combined to specify a list. c(m01:m05, m10, m13, m15:m18) The rule is that if there is a comma in the list, then the c function must be used to enclose

all the items in the list.

8.3.1 No Missing Data

Any of the preceding lists of variables can be passed to the Correlation function as a single argument. Here calculate the correlation matrix of the items of the Deceit and Flattery subscales. Access the data from the data frame mydata and write the computed correlations to the matrix mycor . This example contains no missing data, a topic to be discussed with a later example.

lessR Input Calculate a correlation matrix from a list of variables > mycor <- Correlation(c(m06, m07, m09, m10, m15, m02))

196 Correlation

The first part of the output appears in Listing 8.6 .

Correlation matrix calculated Name: mycor Number of variables: 6 Missing data deletion: pairwise

>>> No missing data

Listing 8.6 First section of output for a correlation matrix with the Correlation function.

The correlation matrix follows in the output. An annotated version of the matrix appears in Figure 8.11 . The matrix consists of three primary sections: the main diagonal and the lower and upper triangles.

Main

m06 m07 m09 m10 m15 m02

Diagonal m06 1.00 0.52 0.25 0.41 -0.15 -0.04 1. 1 00 0 0 0.52 0.25 0.41 -0.15 -0.04 4 Upper m07 0.52 1.00 0.32 0.40 -0.17 -0.09 0.52 1. 0.52 1 00 0 0 0.32 0.40 -0.17 -0.09 9 Triangle

m09 0.25 0.32 1.00 0.25 -0.17 -0.18 0.25 0.32 1. 0.25 0.32 1 00 0 0 0.25 -0.17 -0.18 8

Lower

Triangle

m10 0.41 0.40 0.25 1.00 -0.18 -0.22 0.41 0.40 0.25 1 0.41 0.40 0.25 1 00 -0.18 -0.22 00 -0.18 -0.22 m15 -0.15 -0.17 -0.17 -0.18 1.00 0.25 -0.15 -0.17 -0.17 -0.18 1 -0.15 -0.17 -0.17 -0.18 1 00 0.25 00 0.25

m02 -0.04 -0.09 -0.18 -0.22 0.25 1.00 -0.04 -0.09 -0.18 -0.22 0.25 1 -0.04 -0.09 -0.18 -0.22 0.25 1 00 00

Figure 8.11 Annotated correlation matrix of six Mach IV items from the Correlation function.

The correlation coefficient is symmetrical. The correlation of Variable X with Variable Y is the same as the correlation of Y with X. As such, two correlations in the correlation matrix represent each pair of variables, once in the lower triangle and once in the upper triangle. For example, the correlation of m06 and m07 is 0.52, which appears twice in the top left of the matrix. Also, each item correlates with itself a perfect 1.0, the value that appears in the main diagonal.

8.3.2 Missing Data

The Machiavellian data set has no missing data. To illustrate how the Correlation function

fix function,

addresses missing data, remove one value from the data set. With the R function fix applied

Section 3.2 , p. 54

to mydata , the data value for m06 for the first row of data was removed. The result is shown in Listing 8.7 , which displays the NA value for m06 that indicates a missing data value in an R data frame.

> head(mydata) Gender m01 m02 m03 m04 m05 m06 m07 m08 m09 m10 m11 m12 m13 m14 ... m20

1 0 0 4 1 5 0 NA

Listing 8.7 First two rows of mydata from the head function.

Correlation 197

Next the same call to the Correlation function is run that generated the correlation matrix

in Figure 8.11 . The additional output present for missing data appears in Listing 8.8 . The default pairwise

method for addressing missing data is pairwise deletion in which the data for each correlation is deletion: Calculate each based on rows of data that both have non-missing data values.

correlation coefficient from all

Missing data deletion: pairwise

non-missing data for the two

--- Missing Data Analysis --- variables. 350

Listing 8.8 Sample size for each computed correlation coefficient.

The pattern of missing data can be different for different pairs of variables, so the sample size upon which each correlation is based can also differ. When the correlation matrix is calculated with pairwise deletion in the presence of missing data, the sample size for each correlation should be examined. In extreme cases some correlations could be based on much less data than other correlations, depending on the pattern of missing values. The data table contained only one missing value, for m06 , in all 351 rows of data. This row of data is then dropped in the calculation of the correlation of m06 with all other variables. The result is that the sample size for all the correlations of m06 with other variables is reduced by 1 to 350.

Also present in Listing 8.8 are the summary statistics for the missing data counts. There are six variables in this correlation matrix, so there are 6 × 6 = 36 entries in the correlation matrix. Of these 36 entries, 25 are based on a sample size of 351. The 11 correlations that involve Item m06 are based on a sample size of 350.

These summary statistics are particularly relevant for larger correlation matrices. For a correlation matrix with more than 15 variables the Correlation function does not by default display the sample size matrix nor the correlation matrix. In this situation the minimum sample

size encountered in the calculation of any of the correlation coefficients in the matrix is evident show.n=TRUE

in the summary statistics. To examine the individual sample sizes of each coefficient, the sample option: Display the size matrix can still be displayed with the option show.n=TRUE .

matrix of sample sizes for individual

Listwise deletion is another common method for addressing missing data. If a row of data correlations.

has any missing data values, then that entire row of data is deleted from the analysis. Specify listwise

listwise deletion with the deletion: One miss option set to listwise , of which the default value is pairwise .

missing data value in a row leads to the deletion of the entire row of data.

lessR Input Correlation matrix with listwise deletion > mycor <- Correlation(c(m06, m07, m09, m10, m15, m02),

miss="listwise")

198 Correlation

The relevant part of the Correlation output appears in Listing 8.9 .

Missing data deletion: listwise Sample size after deleted rows: 350

Listing 8.9 Relevant output of Correlation for listwise deletion.

For listwise deletion all correlation coefficients are calculated with the same sample size. The value reported in Listing 8.9 is 350, the data that remain after deleting the first row of data from the analysis due to its one missing value.

Pairwise deletion is generally preferred over listwise deletion because of the loss of data from the listwise procedure. Many data values are deleted that were present in the original data table. The potential problem with pairwise deletion, however, is to ensure that there are not some correlations that are calculated from an extensively diminished sample size.

8.3.3 Graphics

graphics=TRUE: View correlation

Optional graphical portrayals of the correlation matrix are also available. One graphic is a scatter

matrix graphics.

plot matrix and the other is a heat map. Set graphics=TRUE to view these graphics in the

pdf=TRUE: Write

standard graphics windows, or set pdf=TRUE to write the graphics to their respective files.

correlation matrix graphics to pdf

A scatter plot matrix is a table of scatter plots, one for each correlation in the correlation

files.

matrix. The scatter plot matrix for the six Mach IV items is shown in Figure 8.12 . Just as each

correlation in the correlation matrix appears twice, each pair of variables is represented by two

scatter plot

matrix: Table of

scatter plots in the scatter plot matrix. One plot is in the lower triangle and the other plot is in

two variable scatter plots.

the upper triangle of the matrix. The variables in the scatter plots have Likert data values so a small number of possibilities limits the configuration of plotted points. Fortunately, each scatter plot also contains a loess line of best fit, which can help gauge the extent of the relationship. For example, the scatter plots for the two highly most correlated items, m06 and m07 , contain a fit line of pronounced slope.

The other optional graph is a heat map of the correlation matrix, which appears in

heat map:

Figure 8.13 . The heat map is a graphical portrayal of the matrix with each correlation coefficient

Graphical representation of a

replaced by a colored square. The larger the correlation, the darker is the color. The diagonal

matrix with each

elements of the heat map are treated differently. To provide more color separation for off-

number replaced by a colored

diagonal elements, the diagonal elements of the matrix for computing the heat map are

square.

set to 0. The largest correlation in the matrix, 0.52 between Items m06 and m07 , is represented with the two darkest colored squares, which are at the top left of the heat map. The lowest correlations in the matrix are between the items in the two different scales. These correlations are represented by white or very light gray colored squares in the second-to-last and last rows of the matrix as well as the second-to-last and last columns. The differentiation of the two different sub-domains of Mach IV items, Deceit and Flattery, is clearly visible in the Figure 8.13 .

Specify a title for the heat map with the usual R graphics option main . Depending on the

main option: Heat

map title.

size of the variable names, the bottom and the right margins of the heat map might be too

bottom, right

narrow to accommodate the full names. To widen the margins, use the bottom and right

options: Number of lines for each

options, such as bottom=5 to specify five lines for the bottom margin. The scatter plot matrix

margin.

and heat map can be written to a pdf file instead of displayed in a graphics window. To do so,

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● m06 ●

m10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

m15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

m02

Figure 8.12 Scatter plot matrix of six Mach IV items.

Figure 8.13 Heat map of the correlation matrix of six Mach IV items.

200 Correlation

invoke the usual lessR option pdf=TRUE , and if desired, the accompanying size specifications

pdf options,

with pdf.width and pdf.height .

Section 1.4.4 , p. 19