Advanced Multivariate EDA

Exercise 10 Advanced Multivariate EDA

10.1 Objectives

This exercise deals with more advanced methods to explore the multivariate relationships among variables, by means of conditional plots and a three- dimensional scatter plot.

At the end of the exercise, you should know how to: • create conditional histograms, box plots and scatter plots • change the conditioning intervals in the conditional plots • create a three-dimensional scatter plot • zoom and rotate the 3D scatter plot • select observations in the 3D scatter plot • brush the 3D scatter plot

More detailed information on these operations can be found in the Release Notes , pp. 33–43.

10.2 Conditional Plots

We continue the exploration of multivariate patterns with the POLICE data set from Exercise 9. Before proceeding, make sure that the county centroids are added to the data table (follow the instructions in Section 6.2.1 on p. 39). In the example, we will refer to these centroids as XCOO and YCOO.

Figure 10.1: Conditional plot function.

Figure 10.2: Conditional scatter plot option.

A conditional plot consists of 9 micro plots, each computed for a subset of the observations. The subsets are obtained by conditioning on two other variables. Three intervals for each variable (for a total of 9 pairs of intervals) define the subsets.

Start the conditional plot by clicking the associated toolbar icon, or by selecting Explore > Conditional Plot from the menu, as in Figure 10.1. This brings up a dialog with a choice of four different types of plots, shown in Figure 10.2. Two of the plots are univariate (histogram and box plot, covered in Exercise 7), one bivariate (scatter plot, covered in Exercise 8), and one is a map. 1 Select the radio button next to Scatter Plot and click

OK . Next, you need to specify the conditioning variables as well as the vari- ables of interest. In the variable selection dialog, shown in Figure 10.3 on p. 71, select a variable from the drop down list and move it to the text boxes on the right hand side by clicking on the > button. Specifically, choose XCOO

1 Conditional maps are covered in Excercise 12.

Figure 10.3: Conditional scat- Figure 10.4: Variables selected ter plot variable selection.

in conditional scatter plot. for the first conditioning variable (the X variable in the dialog), YCOO for

the second conditioning variable (Y variable), POLICE as the variable for the y-axis in the scatter plot (Variable 1), and CRIME as the variable for the x-axis (Variable 2). 2 The complete setup should be as in Figure 10.4.

Click on OK to create the conditional scatter plot, shown in Figure 10.5 on p. 72.

Consider the figure more closely. On the horizontal axis is the first conditioning variable, which we have taken to be the location of the county centroid in the West-East dimension. The counties are classified in three “bins,” depending on whether their centroid X coordinate (XCOO) falls in the range −91.44603 to −90.37621, −90.37621 to −89.30639, or −89.30639 to −88.23657. The second conditioning variable is on the vertical axis and corresponds to the South-North dimension, again resulting in three intervals. Consequently, the nine micro plots correspond to subsets of the counties arranged by their geographic location from the southwestern corner to the

northeastern corner. 3 The scatter plots suggest some strong regional differences in the slope

of the regression of police expenditures on crime. Note that this is still exploratory and should be interpreted with caution. Since the number of observations in each of the plots differs, the precision of the estimated slope coefficient will differ as well. This would be taken into account in a more rig- orous comparison, such as in an analysis of variance. Nevertheless, the plots confirm the strong effect of the state capital on this bivariate relationship (the middle plot in the left hand column), which yields by far the steepest

2 The dialog is similar for the other conditional plot, except that only one variable can 3 be specified in addition to the conditioning variables.

The conditioning variables do not have to be geographic, but can be any dimension of interest.

Figure 10.5: Conditional scatter plot.

slope. In two of the plots, the slope is even slightly negative. The categories can be adjusted by moving the “handles” sideways or up and down. The handles are the small circles on the two interval bars. To convert the matrix of plots to a two by two format by collapsing the east- most and northern-most categories together, pull the right-most handle to the right, as in Figure 10.6 on p. 73. Similarly, pull the top-most handle to the top. The two by two classification still suggests a difference for the western-most counties. Experiment with moving the classification handles to get a better sense for how the plots change as the definition of the subsets is altered. Also, try using different variables (such as tax and white) as the conditioning variables.

Figure 10.6: Moving the category breaks in a conditional scatter plot.