Summary Statistics

5.3 Summary Statistics

Summary statistics describe characteristics of the distribution of sample data values. Summary statistics also describe characteristics of how two or more distributions relate to each other.

5.3.1 Overview of Summary Statistics

There are three different types of summary statistics: sample size information, parametric statistics, and order statistics. The sample size information is the number of actual data values present and the number of missing data values. The other types of statistics are described next. All three types of statistics are useful for summarizing a distribution, and all three should be examined for each variable of interest.

parametric

Parametric statistics only apply to variables with at least interval data values. Interval data statistics: Only have equal intervals of magnitude between any two values one unit apart. Numbers that apply to interval data values. represent a lower quality of measurement only specify a rank ordering of the data values. The interval data, most commonly applied parametric statistics are the mean and the standard deviation. Less Section 1.6.3 ,

frequently encountered are the skewness and kurtosis. p. 23 The mean is the usual arithmetic average, the sum of all the data values divided by the total mean: Arithmetic

number of non-missing data values. Its purpose is to provide an indicator of the middle of the average. distribution. The mean is probably the most widely reported characteristic of a distribution.

106 Continuous Variables

standard

The standard deviation is an indicator of the variability of a distribution about the mean. If

deviation: Indicator of

the standard deviation is large, then the values tend to be spread out much about the mean. An

variability.

example would be scores on a Final for a class that varied from 50% up to 100%, with scores equally distributed throughout this range. That is, many scores between 50% and 60%, between 60% and 70%, all the way up to many scores between 90% and 100%. If the standard deviation is small, then the values do not vary much about the mean. For the Final scores with a low standard deviation, most of the values might be between 85% and 95%, with only a few scores outside of this interval.

A key property of the standard deviation is that it relates closely to normally distributed data. For a population of normally distributed data, about 68.3% of all the data values are within one standard deviation of the mean. And about 95.5% of all data values are within

2 standard deviations of the mean. For example, height is normally distributed, and for US women, the mean is about 65.5 inches with a standard deviation of 2.5 inches. Two standard deviations equals (2)(2.5) or 5, so a little more than 95% of all US women have a height between

60.5 inches and 70.5 inches. This relationship between the standard deviation and the normal distribution lies at the core of much statistical analysis.

skewness:

Skewness is an indicator of the symmetry of a distribution of values. For a symmetric

Indicator of symmetry.

distribution, the right side of the distribution is a mirror image of the distribution’s left side. Negative skewness values indicate that the distribution tends to have a tail on the left side and positive skewness values indicate a tail on the right side.

kurtosis: Indicator

Similar to skewness, kurtosis is an indicator of the shape of a distribution, such as the

of the peakedness.

measured data values for a variable. Specifically, kurtosis indicates a distribution’s “peakedness” relative to the normal curve. A large value of kurtosis indicates that the values of the variable are more spread out so that the distribution has “fat” tails. A low value of kurtosis indicates that the values are concentrated around the mean, resulting in “skinny” tails.

order statistic:

Order statistics are applicable to a wider range of distributions of data values than are

Specify position in an ordered set of

parametric statistics. An order statistic specifies some characteristic of the position of a specific

data values.

value within that distribution, which may or may not be an actual data value. Compute an

median: The

order statistic only after the values of a variable have been sorted from the smallest value to the

value midway in an

largest. The most well-known order statistic is the median, the value literally midway between

ordered distribution.

the smallest and largest values of the sorted distribution.

quartile: A value

To derive the median, split the sorted distribution into two parts with the same number

that separates the values of an

of values in each part. Generalizing, the quartiles split the ordered distribution into four equal

ordered

parts. The median is the second quartile in this context. The first quartile cuts off the bottom

distribution into the first, second,

25% of the distribution and the third quartile cuts off the bottom 75% of the distribution.

third, or fourth

The most common order statistic for expressing variability is the interquartile range or IQR .

quarter.

The IQR is the difference between the third and second quartiles of a distribution. That is, the

IQR: Interquartile

IQR specifies the range of the middle 50% of the data values, centered on the median.

range, difference between first and

An outlier is a value considerably different from most remaining values of the distribution.

third quartiles.

There are many ways to more precisely define an outlier. The definition applied here is based

outlier: A value

on the concept of a box plot, more fully described in the next section. Outliers always should

far from most of the remaining data

be identified for any variable because their values could represent a coding error. Or, more

values.

fundamentally, an outlier could represent the outcome of a process different from the process that generated all or most of the other values of the distribution. If so, then mixing all the values into a single analysis may be accurate numerically, but may not represent any process that actually exists in the real world.

Continuous Variables 107

5.3.2 Summary Statistics for a Single Numerical Variable

All analyses of a variable should include its basic summary statistics.

Scenario Obtain summary statistics Obtain the summary statistics of the variable Salary in the Employee data set, both parametric and non-parametric, that numerically describe key characteristics of the distribution of Salary across the 37 employees.

The primary lessR function for numerical summaries of a variable is SummaryStats . Summary statistics can also be obtained from Histogram and BarChart , but a direct call to SummaryStats by default provides more statistics, without the graphics. To invoke SummaryStats , abbreviated ss , follow the usual pattern for lessR functions, illustrated here for the variable Salary.

SummaryStats

function: Calculate summary statistics.

lessR Input Summary statistics > SummaryStats(Salary)

or

> ss(Salary)

In this example, the variable labels have been included in the analysis, and so the variable label for Salary is displayed as part of the output. The result appears in Listing 5.2 .

variable labels, Section 2.4.1 ,

--- Salary, Annual Salary (USD) --- p. 46 n miss

Listing 5.2 Parametric and order summary statistics for a continuous variable.

Or, invoke the brief version of SummaryStats , the version referenced by the functions Histogram , BoxPlot and the one variable version of ScatterPlot . Listing 5.3 shows the

abbreviated form of the function call. As can be seen from the output, the largest value in the distribution of data values is considered an outlier, apart from the remaining 36 values. None of the values are missing, so the value of Salary is present for each of the 37 employees. The mean is a little larger than the median, indicating that the distribution may have a tail in the upper side of the distribution, exhibiting skew, a lack of symmetry, consistent with the positive value of skew. The standard deviation is almost $22,000, so if the population from which the data were obtained is normal, roughly 95% of all the data values would be within two standard deviations, or a little less than $44,000 on either side of the mean of about $63,800.

108 Continuous Variables

> ss.brief(Salary) --- Salary, Annual Salary (USD) ---

n miss

Listing 5.3 Output of the brief version of summary statistics.

5.3.3 Available Options for Summary Statistics

Variable specification. The primary variable for analysis is always the first value passed to SummaryStats . If there is a second variable, a categorical variable, then either place it in the second position, or specify it with the by option, usually still in the second position though not necessarily. If the specified variable is in the data frame mydata , then the name of the corresponding data frame need not be specified. Otherwise, specify the relevant data frame with data .

Other options. A less complete version of the summary statistics can be obtained by setting the option brief=TRUE , or with the abbreviation ss.brief . The brief version limits the display to the sample size information, the mean, standard deviation, minimum, median, and maximum

digits.d function, Section 1.3.5 ,

values. The number of displayed decimal digits can be changed from the default value by setting

p. 14

digits.d . The n.cat option specifies the maximum number of unique values of a numeric

n.cat option,

value that can be obtained and still be interpreted as a categorical variable.

Section 2.2.7 , p. 39

5.3.4 Summary Statistics for All the Variables in a Data Frame

The most basic and usually first analysis of the variables in a data frame is to examine their distributions, both for continuous and categorical variables. The numeric summaries of a distribution are the summary statistics, which differ for continuous and categorical variables.

Scenario Display the summary statistics for all variables For all numerical variables in a data frame, provide statistics such as the mean, median, standard deviation, and others. For all categorical variables, provide the values and the frequency and proportion of occurrence for each value.

If the first value in the call to SummaryStats is a data frame, all the variables in the data frame are analyzed. If there is no value passed to SummaryStats , then the data frame mydata is assumed.

lessR Input Summary statistics for all variables in a data frame > SummaryStats()

or

> ss()

Continuous Variables 109

With this option, SummaryStats classifies each variable as either numerical or categorical. Then SummaryStats provides the appropriate summary statistics, either means, etc., or a table of the frequencies for each category. It is better, however, to declare all categorical variables in the analysis as R factors. If the variable is a factor, then SummaryStats always analyzes as a categorical variable.

factor function,

Section 1.6.3 There is also the , n.cat option that can be passed to SummaryStats to denote variables

p. 22

with only a few unique values as categorical, even though the data type is numerical. The n.cat n.cat option, sets the definition of “a few”. Numerical variables with only n.cat unique values, or less, are Section 2.2.7 , p. 39 interpreted as categorical variables by SummaryStats . Here define all numerical variables with 7 or less values in a data frame as categorical solely for the purpose of calculating the appropriate summary statistics.

lessR Input Summary stats with n.cat option to define categorical variables > SummaryStats(n.cat=7)

or

> ss(n.cat=7)

If a variable with numerical values is interpreted as categorical according to the definition of n.cat , SummaryStats displays a message regarding its interpretation. Listing 5.4 shows this message for the variable HealthPlan, which has three numeric values, each corresponding to a different health plan.

>>> Variable is numeric, but only has 3 <= n.cat = 7 levels, so treat as categorical. To obtain the numeric summary, decrease n.cat to indicate a lower number of unique values such as with function: set. Perhaps make this variable a factor with R factor function.

Listing 5.4 A numerical variable with a small number of categories will be treated as a categorical variable.

The same n.cat parameter also applies to BarChart and Histogram . The value of n.cat can also be applied to all subsequent function calls if set with the function set .

set function, Section 2.2.7 , p. 39

5.3.5 Summary Statistics of a Numerical Variable by Categories

Sometimes the summary statistics of a numerical variable are of interest for each of the values for each group defined by a categorical variable.

Scenario Compute summary statistics for each value of a second variable How does Salary vary by department? Display the summary statistics of Salary for each level of Dept.

For the statistics that summarize the sample, use the by option, usually for a categorical variable with relatively few unique values. The analysis of the summary statistics of Salary as they vary across the five different departments is shown in Listing 5.5 . This version of the output

110 Continuous Variables

is for the brief form of SummaryStats , specified by ss.brief , or by adding brief=TRUE in the call to SummaryStats .

lessR Input Summary statistics (brief) for each level of a second variable > ss.brief(Salary, by=Dept)

Salary, Annual Salary (USD) by Dept, Department Employed ------------------------------

max ACCT

n miss

Listing 5.5 Brief version of summary statistics of Salary for each level of Dept.

Now compare each of the summary statistics across the different groups, one group for each value of the categorical variable. Of course this comparison applies only to this particular data

one-sample t-test,

set for the 37 employees. Do an inferential comparison of the means for the analysis of the

Section 6.2 , p. 124

more relevant population means. Invoke a t-test of the mean difference for two group means and, for multiple group means such as in Listing 5.5 , a one-way analysis of variance, one-way

one-way ANOVA,

ANOVA.

Section 7.2 , p. 150