Density Plot

5.5 Density Plot

A recent development that has become feasible with the advent of computer graphics, the

density plot:

density plot, extends the concept of the histogram to a more modern version. A density plot is a

Smoothed out histogram, such as

kind of idealized histogram in which the bin width of a histogram diminishes to zero, leaving

a normal curve.

a smooth curve instead of a jagged histogram. This smooth curve better represents the shape of the underlying distribution, which likely is not characterized by the sharp edges of rectangles, but rather a smooth continuity. The most well-known example of this smooth frequency-like density curve is the normal curve.

Continuous Variables 115

5.5.1 Default Density Plot and Analysis

A density curve can be estimated whenever a histogram is computed.

Scenario Obtain a density plot of a continuous variable From a sample of data values of Salary from the Employee data set, estimate the corresponding smooth curve, the density curve, that approximates the true shape of the underlying distribution of Salary without the jagged edges of histogram bins. Also simultaneously show the estimated density curve superimposed over a histogram of the data.

The lessR function for a density plot is Density , abbreviated dn . This function presents

several enhancements over plotting the output of the corresponding R function density . The normal

function Density by default imposes over a histogram both the normal densities as well as a densities: The smooth normal general density curve.

curve.

To invoke the lessR function, just enter its name and the relevant variable name, enclosed in parentheses.

Density function: Plot smoothed normal and general density

lessR Input Density plot

distributions.

> Density(Salary)

or

> dn(Salary)

This simple statement results in the histogram and two density curves in Figure 5.8 . The data file includes the variable labels. That way, the more descriptive variable label appears on the graph instead of the more concise but less descriptive variable name.

Annual Salary (USD) Figure 5.8 Default histogram with superimposed normal and general density curves.

116 Continuous Variables

The Density function also generates text output. One key aspect of data analysis is the actual sample size that underlies the analysis. There are 37 rows of data in the data table, but it is possible that there is much missing data for any one particular variable.

Sample Size: 37 Missing Values: 0

Here all 37 potential values of Salary are present.

Scenario Test the hypothesis that a distribution is normal

A sample of data value is never perfectly normal. Is it reasonable, however, to conclude that the population from which the data values are sampled is normal? Use the Shapiro– Wilk statistic to test the null hypothesis that a distribution of data values is sampled from

a normal distribution.

This text output includes the formal test of the null hypothesis that the distribution is sampled from a normal population.

Null hypothesis is a normal population Shapiro-Wilk normality test: W = 0.9117, p-value = 0.0063

The test statistic based on the assumption of the null hypothesis that the population is normal is the Shapiro–Wilk statistic, W = 0 . 9117. How large is this value? The p-value provides the answer. If the null hypothesis is true, then the probability of obtaining a W statistic as large as or larger than W = 0 . 9117 is only p-value = 0 . 006, less than the usual alpha criterion of α= 0 . 05, the definition of an improbable event. The value of W is too large to be consistent with the null hypothesis of normality, a result due in part to the slight right skew, the small tail on the

bandwidth:

right hand side of the distribution.

Extent of diminishing

Also reported is the bandwidth used to construct the estimated density curve, which is

influence of nearby

values to calculate the position of a point on a density

Density bandwidth for general curve: 9529.045

curve.

For a smoother curve, increase bandwidth with option: bw

To estimate each point on the curve, the surrounding data values are considered with a set

bandwidth

of diminishing weights. Data values close to the given point are given much influence in the

option: Set the bandwidth used to

location of the point on the density curve, whereas data values far from the given point have

estimate the

little if any influence on the location of the given position on the estimated curve. The bandwidth

density curve.

option, bw , specifies the influence that data values have on the location of the current point on the density curve depending on their distance from that point. Increasing the bandwidth option, bw , further smoothes the graph because more surrounding data values contribute to its location on the density curve.

5.5.2 Other Available Options

Variable specifications. The variable plotted occupies the first position in the list of options. If the variable is in the data frame mydata , then the name of the corresponding data frame need not be specified. Otherwise, specify the relevant data frame with data .

Continuous Variables 117

Colors. Either use the default blue color theme or choose the color theme with the set function. The individual color of the background and grid lines are set by col.bg and col.grid , set function, respectively. Set the color of the histogram bars with col.fill . The borders of the curves Section 1.4.1 , p. 16 default to "black" , but can be changed according to col.nrm and col.gen . The fill color for the normal curve, col.fill.nrm , and general density curve, col.fill.gen , are each set to

be partially transparent so that their overlap can be directly viewed, as well as the histogram plotted behind them. Set any of these colors to "transparent" to remove the corresponding fill color from the graph.

Bins. The default specification for setting the histogram bins is the same as Histogram , the "Sturges" algorithm. Obtain another set of bins with either one or both of bin.start and bin.start, bin.width bin.width . The breaks option is not set at the user level as it is for Histogram .

options, Section 5.2.2 ,

To apply some of these options, revise the first density plot in Figure 5.8 . First, the histogram p. 102 is too jagged, so a new density plot is generated with a bin width for the histogram of $15,000. Also, the general density curve is a bit wobbly on its right side, a characteristic that probably reflects sampling error. Increasing the bandwidth to $12,000 further smooths this curve.

Generate the revised graph in Figure 5.9 according to the following function call.

lessR Input Density plot with specified histogram and bandwidth > Density(Salary, bin.start=25000, bin.width=15000,

type="general", bw=12000)

Annual Salary (USD) Figure 5.9 Histogram with superimposed general density curve and customized histogram.

This graph succinctly summarizes the distribution of Salary for those 37 employees with

a smooth curve. The plot reveals the minimum and maximum values and the relative size of nearby data values from our reference point upon which the hypothesis is based. The slight right skew is also apparent from this figure.

118 Continuous Variables