EDA Basics, Linking

7.3 Linking Box Plots

The second basic EDA technique to depict the non-spatial distribution of a variable is the box plot (sometimes referred to as box and whisker plot). It

Figure 7.10: Histogram with 12 intervals.

Figure 7.11: Base map for St. Louis homicide data set. shows the median, first and third quartile of a distribution (the 50%, 25%

and 75% points in the cumulative distribution) as well as a notion of outlier. An observation is classified as an outlier when it lies more than a given multiple of the interquartile range (the difference in value between the 75% and 25% observation) above or below respectively the value for the 75th percentile and 25th percentile. The standard multiples used are 1.5 and 3 times the interquartile range. GeoDa supports both values.

Clear all windows and start a new project using the stl hom.shp homi- cide sample data set (use FIPSNO as the Key). The opening screen should

Figure 7.12: Box plot function.

Figure 7.13: Variable selection in box plot.

show the base map with 78 counties as in Figure 7.11 on p. 49. Invoke the box plot by selecting Explore > Box Plot from the menu (Figure 7.12), or by clicking on the Box Plot toolbar icon. Next, choose the variable HR8893 (homicide rate over the period 1988–93) in the dialog, as in Figure 7.13. Click on OK to create the box plot, shown in the left hand panel of Figure 7.14 on p. 51. The rectangle represents the cumulative distribution of the variable, sorted by value. The value in parentheses on the upper right corner is the number of observations.

The red bar in the middle corresponds to the median, the dark part shows the interquartile range (going from the 25th percentile to the 75th percentile). The individual observations in the first and fourth quartile are shown as blue dots. The thin line is the hinge, here corresponding to the default criterion of 1.5. This shows six counties classified as outliers for this variable.

The hinge criterion determines how extreme observations need to be before they are classified as outliers. It can be changed by selecting Option

Figure 7.14: Box plot using 1.5 Figure 7.15: Box plot using 3.0 as hinge.

as hinge.

Figure 7.16: Changing the hinge criterion for a box plot. > Hinge from the menu, or by right clicking in the box plot itself, as shown

in Figure 7.16. Select 3.0 as the new criterion and observe how the number of outliers gets reduced to 2, as in the right hand panel of Figure 7.15.

Specific observations in the box plot can be selected in the usual fashion, by clicking on them, or by click-dragging a selection rectangle. The selec- tion is immediately reflected in all other open windows through the linking mechanism. For example, make sure you have the table and base map open for the St. Louis data. Select the outlier observations in the box plot by dragging a selection rectangle around them, as illustrated in the upper left panel of Figure 7.17 on p. 52. Note how the selected counties are highlighted in the map and in the table (you may need to use the Promotion feature to get the selected counties to show up at the top of the table). Similarly, you can select rows in the table and see where they stack up in the box plot (or any other graph, for that matter).

Make a small selection rectangle in the box plot, hold down the Control

Figure 7.17: Linked box plot, table and map.

key and let go. The selection rectangle will blink. This indicates that you have started brushing, which is a way to change the selection dynamically. Move the brush slowly up or down over the box plot and note how the selected observations change in all the linked maps and graphs. We return to brushing in more detail in Exercise 8.

7.4 Practice

Use the St. Louis data set and histograms linked to the map to investi- gate the regional distribution (e.g., East vs West, core vs periphery) of the homicide rates (HR****) in the three periods. Use the box plot to assess the extent to which outlier counties are consistent over time. Use the ta- ble to identify the names of the counties as well as their actual number of homicides (HC****).

Alternatively, experiment with any of the other polygon sample data sets to carry out a similar analysis, e.g., investigating outliers in SIDS rates in North Carolina counties, or in the greenness index for the NDVI regular grid data.