Scatter Plot and Box Plot

5.4 Scatter Plot and Box Plot

The histogram is the traditional graphical presentation of the distribution of a continuous variable, but there are other possibilities.

5.4.1 One-variable Scatter Plot

Perhaps the simplest plot of a distribution of values for a continuous variable is its scatter plot,

one-variable

or dot plot. The scatter plot for a single variable is particularly appropriate for a relatively small

scatter plot: Each value plotted such

number of data values. For each data value along a numerical scale, the scatter plot includes a

as with a small

mark, usually plotted as a dot. The scatter plot applies to the analysis of the employee data with

circle along the value axis.

measurements on 37 employees, but would be less effective for 370 employees. The distribution of a large number of individual data values is better expressed with the bins of a histogram.

Scenario Obtain the one-variable scatter plot The salaries have been recorded for only 37 employees, a number small enough that the

individual salaries could be successfully plotted as individual points. How are these 37 Salaries distributed?

Continuous Variables 111

Obtain the scatter plot of a variable as with any other lessR function, specify the function name, here ScatterPlot , and then the relevant variable name enclosed in parentheses. As with most lessR functions there is also an abbreviated form of the function name, here sp .

ScatterPlot

function: One- or two-variable scatter plot.

lessR Input Scatter plot for one variable or dot plot > ScatterPlot(Salary)

or

> sp(Salary)

The output of ScatterPlot for the portrayal of the distribution of the 37 sample values of Salary is shown in Figure 5.5 . To facilitate interpretation, each point by default is plotted with

a transparent background so that the overlapped points display in a darker color to indicate the extent of the overlap.

Annual Salary (USD) Figure 5.5 Gray scale lessR one-variable scatter plot, sometimes called a dot plot.

outlier,

An outlier is a value in a distribution that is considerably different from most of the other Section 5.3.1 , values. For graphs with color, a potential outlier, following Tukey’s (1977) definition explained p. 106

on page 113, is displayed in dark red. Actual outliers, more extreme data values, are labeled in a potential

more vivid red. For the gray scale plot in Figure 5.5 , potential and actual outliers are displayed outlier: A value that could be an with a square and a diamond, respectively. The largest value in this distribution is considered a outlier. potential outlier according to this definition.

actual outlier: A value that appears to be an outlier.

5.4.2 Available Options for the ScatterPlot Function

Variable specifications. The variable plotted is the first position supplied to the ScatterPlot function. If the variable is in the data frame mydata , then the name of the corresponding data frame need not be specified. Otherwise, specify the relevant data frame with data .

Colors. The default colors are set by the current color theme from the set function. Any color themes, color theme can be modified by choosing an individual color of the bars, the bar borders, Section 1.4.1 , p. 16 the background and the grid lines with col.fill , col.stroke , col.bg , and col.grid , respectively. The color of the points for outliers is set by col.out15 with a default of col.fill, etc., "firebrick4" and by col.out30 for more extreme outliers with a default of "firebrick2" . Table 1.1 , p. 17 Set any of these colors to "transparent" to remove the color.

Plot symbols. For regular points, not outliers, the default plot symbol is a circle, with a default of partial transparency, as specified by the set function option trans.fill.pt . The plot symbol

112 Continuous Variables

for outliers is a solid circle. The corresponding options, pt.reg and pt.out , are set equal to their default values of the numbers 21 and 19, respectively. Obtain the list of available plotting symbols and their corresponding reference numbers with ?points . For example, an unfilled diamond is number 23.

Labels. As applies to most R graphs, the label for the x-axis is the xlab option. Set the graph title with the main option. If variable labels are present, then the axis label is set to the variable label unless overridden with the main option.

Other options. More options are available that are defined in the constituent R function

plot.method

stripchart upon which ScatterPlot relies for a one-variable plot. One default setting

option: Set to "jitter" to randomly

for ScatterPlot is method="stack" , which means that multiple points with the same

move points.

value are stacked on top of each other instead of overprinting. Another possibility is to set method="jitter" , which randomly moves each point up or down so that points with the same value are not aligned over each other. Control the amount of jitter with a separate option, jitter , with a default setting of 0.1.

As is true with all the graphic plots, more options from the graphics function par are also available. To control the size of the axis labels, use cex.axis and cex.names for the size of the axis names, with the magnification factor of 1 as the default. Also, the graph’s margins can

be set with the mar option, as explained in ?par .

5.4.3 Box Plot

IQR, Section 5.3.1 ,

The “box” in a box plot is based on the interquartile range or IQR , the positive difference

p. 106

between what are essentially the first and third quartiles. The IQR is the range of data that contains the middle 50% of all the data values. The width of the box is approximately the IQR , with a line through the median and perpendicular lines extending out from the edges. Tukey did not literally use the first and third quartiles, but rather an approximation called “hinges”, apparently because they are easier to compute than quartiles, an important consideration with pre-computer technology. For our purposes, we consider the box plot based on the quartiles, which are almost if not equal to these hinges.

Scenario Generate a box plot The employee data set contains the salaries of 37 employees. What is the pattern of the distribution of these salaries? To visualize this distribution, generate the box plot of these salaries and identify any potential and actual outliers. Also display the basic summary

potential

statistics of Salary.

outlier: An extreme value that may be an outlier.

actual outlier:

The box plot is particularly useful to identify outliers. The inventor of the box plot, Tukey

An extreme value that likely is an

(1977), identified two types of outliers, based on the concept of an IQR . The potential outlier

outlier.

lies between 1.5 IQR s and 3.0 IQR s from the edges of the box. An actual outlier, according to

Continuous Variables 113

this definition, lies more than 3.0 IQR s from either box’s edge. Points past the whisker are likely whisker: A line outliers. There are many ways to define an outlier, but Tukey’s definition appears to work well from a box’s edge that extends to the in practice.

most extreme data

Obtain the lessR box plot with BoxPlot , abbreviated bx .

value that is not a potential outlier.

lessR Input Box plot > BoxPlot(Salary)

or

> bx(Salary)

BoxPlot function:

The resulting box plot of Salary in Figure 5.6 shows one potential outlier beyond the right Generate a box plot.

whisker. The plot also demonstrates some right-tailed skew with the right side of the box after the median bar, which is longer than the corresponding left side. Also the right whisker is longer than the left whisker. Variable labels are present to provide the label automatically for the horizontal or x-axis.

Annual Salary (USD) Figure 5.6 Default lessR box plot.

5.4.4 Available Options

Variable specifications. The variable plotted occupies the first position in the list of options, so list its name first. If the variable is in the data frame mydata , then specify the relevant data frame with data .

Colors. Either rely upon the default blue color theme, or set the colors with the current color color themes, theme from the set function. Modify any color theme by choosing an individual color of the Section 1.4.1 , bars, the bar borders, the background and the grid lines with col.fill , col.stroke , col.bg , p. 16 and col.grid , respectively.

col.fill, etc., Table 1.1 , p. 17

Orientation. The default orientation is a horizontal plot. To plot vertically, specify horiz=FALSE . add.points=TRUE

The concept of a box plot and a one-variable scatter plot, or dot plot, can be combined on option: Super-

the same graph. To do this, invoke the impose a dot plot add.points=TRUE option for the box plot.

over the box plot.

114 Continuous Variables

lessR Input Box plot with superimposed scatter plot > BoxPlot(Salary, add.points=TRUE)

The combined box plot and dot plot in Figure 5.7 show in one graph the highlighted potential outlier, the overall shape of the distribution, and the specific values upon which the box plot is based.

Figure 5.7 Box plot with superimposed scatter plot. The text output in Listing 5.6 from BoxPlot reveals the value of the potential outlier.

--- Salary, Annual Salary (USD) --- Present: 37

Missing: 0 Total : 37 ... Outlier: 124419.2

Listing 5.6 Outlier identification.

Histograms also can be used to identify outliers, but the box plot explicitly identifies such data values as part of the plot.