One Categorical Variable

4.2 One Categorical Variable

Consider first the description of the values of a categorical variable. To illustrate these analyses, Employee data we return to the Employee data set.

table, Figure 1.7 , p. 21

> mydata <- Read("Employee", format="lessR")

format="lessR"

The data set contains both numerical variables and categorical variables.

option, Section 2.3.5 , p. 44

4.2.1 Describe the Sample with a Bar Chart and Statistics

The primary lessR function for the analysis of the sample values of a categorical variable is

BarChart , abbreviated bc , which provides both a graph and a numerical analysis. The graph BarChart

is a bar chart of the frequencies of the variable. The numerical analysis provides a table of function: Create a bar chart and these frequencies with the corresponding proportions and accompanying inferential analysis summary statistics. in the form of chi-square. The BarChart function enhances the standard R function barplot for graphics and also assesses the table , addmargins , and chisq.test R functions for the provided statistics.

The function BarChart invokes the lessR function SummaryStats for the summary SummaryStats

statistics of a categorical variable. If the graph is not of interest, then obtain the BarChart function: Compute summary statistics. numerical output directly with SummaryStats . To analyze the variable as categorical with SummaryStats , the variable’s data type either must be an R factor , the R data type for factors, or a numerical variable with a small number of unique values as determined by the setting of

n.cat n.cat option, .

Section 2.2.7 ,

Ideally define each categorical variable in the data frame as an R factor, either by default p. 39 after reading non-numeric data values or with the factor function. The BarChart function, factor function, however, does not enforce this criterion as it analyzes any variable as a categorical variable. So Section 3.3.2 , p. 59 BarChart generally is only useful for the analysis of the values of a variable with relatively few unique values.

Scenario Generate a bar chart for a categorical variable For a variable with a relatively low number of unique data values, generate a bar chart to show the frequency of occurrence of each category, and also a corresponding frequency table.

Consider the number of people who work in each of the five areas of employment as Employee data indicated by the value of the variable Dept. The data values that record the department in which table, Figure 1.7 , p. 21 an employee works form a categorical variable, composed of discrete, non-numeric categories such as ACCT for Accounting and SALE for Sales.

80 Categorical Variables

lessR Input Bar chart > BarChart(Dept)

or

> bc(Dept)

The resulting bar chart is shown in Figure 4.1 . Because the variable labels were already read

variable labels,

into R , the variable label for Dept by default appears as the title of the graph. As is true of almost

Section 2.4 , p. 45

all standard R graphics, including for lessR functions, a title can be manually created with the main option, such as main="My Title" .

ency qu

Fre

ACCT ADMN

Department Employed

Figure 4.1 Default bar chart for a single categorical variable.

The statistics from BarChart in Listing 4.1 provide the frequencies for each of the categories, here the five categories of Dept. Also provided are the corresponding overall sample size and the proportions. By default R lists the order of the categories alphabetically. This ordering can

factor function,

be changed with the R factor function.

Section 3.3.2 , p. 60.

--- Dept, Department Employed ---

ACCT ADMN FINC MKTG SALE

Listing 4.1 Summary statistics from BarChart.

Far more people work in the Sales department, 15, than in any other of the four remaining departments. The corresponding sample proportion is 0.417, so 41.7% of all the employees in this sample worked in Sales.

Categorical Variables 81

4.2.2 Generalize Beyond the Sample with Inferential Analysis The output of BarChart includes the inferential analysis of the sample frequencies in terms chi-square

of the chi-squared hypothesis test. The underlying motivation of the inferential analysis is the goodness-of-fit test: Inferential test generalization of the sample results to the population as a whole from which that sample was of a relationship obtained. In this sample more people are in Sales than any other category. Would this same between the values of a categorical qualitative pattern of differences likely occur in a second sample of 37 different employees from variable. the same population of employees from which the first sample was obtained?

Maybe in another sample Marketing would be the most frequently occurring group instead of Sales. Particularly for small samples the results are notoriously unstable from sample to sample, but hopefully the same qualitative pattern would be obtained in a new sample. The inferential analysis addresses the question whether the results are a true reflection of the underlying population values. Do the results persist over repeated samples, or are they a sampling fluke observed in this sample, but not the next?

Here the analysis is a test of the assumption that the cell frequencies are consistent with

those expected under the null hypothesis of equal probabilities of group membership. Even if the null hypothesis

null hypothesis of no association is true, the sample frequencies and associated probabilities will of equal group probabilities: likely not be equal to the exact value expected under the assumption of equal probabilities. Given Membership in this assumption, are the obtained frequencies and their corresponding sample probabilities, the each group is equally likely. proportions, a “reasonable” outcome or not? If this assumption of no association is true, then the sample cell frequencies should be reasonably close to their expected values, which here is based on equal population proportions.

In this particular sample the sample probability of randomly selecting an employee in Sales is 0.417, much larger than the sample probability of 0.111 for Finance. This large discrepancy lends some credibility to the belief that the population probabilities are not equal, and suggests that Sales has the most employees. The large differences between the obtained sample probabilities indicate that this outcome has a low probability given the assumption of the null hypothesis.

Scenario Inferential analysis of the proportions of Departmental membership In this particular sample, the department with the most membership is Sales, with a proportion of 41.7% of all employees. Does this pattern likely exist in the population? Or is this sample result a sampling fluke with the reality that the probabilities of employment in all the departments are equal?

Are the probabilities equal in the population as a whole? Simply looking at the sample proportions and observing how discrepant they are from each other, however, cannot answer this question. This assessment requires the application of formal probability theory in the form p-value: of a p-value, the probability of the outcome at least as extreme given a true null hypothesis. Probability of the result at least as In this example the null hypothesis is the assumption of equal probabilities for membership extreme given a in the five groups. The result from BarChart , or SummaryStats upon which BarChart relies, true null hypothesis. follows.

Chi-squared test of null hypothesis of equal probabilities Chisq = 10.94444,

df = 4, p-value = 0.0272 df = 4, p-value = 0.0272

82 Categorical Variables

probabilities: Evaluate if the probabilities of each category are

If the sample proportions were all exactly equal to each other, then the chi-square value

equal.

would be zero. Even if the true proportions are all equal, however, the sample proportions will generally not be equal. The issue is whether the chi-square statistic is so much larger than zero that the hypothesis of equal probabilities is no longer tenable. For this sample, the computed chi-square statistic is 10.944. What is the probability of obtaining a chi-statistic as large as 10.944 or larger in this situation, assuming that the true probabilities are all equal?

The probability of the outcome, or a more deviant outcome, given the assumption of the null hypothesis is the p-value, here reported as 0 . 028. The accepted value for what defines a low

alpha level: The

probability, the alpha level or α , is commonly set at 0.05. Define any probability value below

definition of a low probability that

α= 0 . 05 as sufficiently unusual given the underlying assumption of the null hypothesis. The

specifies an

low probability of such an outcome, given the assumption, leads the analyst to conclude that

unusual event.

the underlying assumption of the null hypothesis is likely not true. Equal probabilities test: p -value = 0 . 027 <α= 0 . 05 , reject H 0

If the assumption of equal group probabilities is true, then a sample with group probabilities this discrepant from equality to yield an obtained chi-square of 10.944 is only 0.028. We conclude that the null hypothesis of equal population probabilities of departmental membership is likely false. Some Departments have more employees than others. Informally we may conclude that Sales has the most employees because that pattern matches the sample result, but a precise evaluation of this later statement requires yet another analysis beyond our scope.

Note that the p-value does not inform us as to the probability that the null hypothesis of equal population probabilities of group membership is true. Rather it informs us that if the population probabilities are equal, then the sample result is a low probability event. From this formal probability result we take the additional step and conclude that the null hypothesis is unlikely. The true population probabilities are unlikely to be equal. This conclusion, however, is qualitative with the word “unlikely” not precisely defined with a numerical probability even though the p-value is a precise quantitative result.

4.2.3 Available Options for the One-variable Bar Chart

The chart shown in Figure 4.1 is the default chart produced by BarChart , but there are many more possibilities provided by the available parameter options. These different options are described next, organized into groups according to their functionality.

data option: The

Variable specifications. The variable plotted occupies the first position in the list of options. If

name of the input data frame.

the variable is in the data frame mydata , then the name of the corresponding data frame need not be specified. Otherwise, specify the relevant data frame with the data option.

color theme,

Colors. The default colors are from the current color theme, with a default of colors="blue"

Section 1.4.1 , p. 16

as specified by the set function. Colors of individual components of the graph can also be

changed according to the col.fill , col.stroke , col.bg , and col.grid options.

color options,

Table 1.1 , p. 17

To color each bar individually, specify a list of colors instead of a single color, usually one color for each bar. Consider a bar chart for Gender with values Female and Male. Specify plum and tan , respectively, for these values with no background color.

> BarChart(Gender, col.fill=c("plum","tan"), col.bg="transparent")

Categorical Variables 83

Whenever multiple values are specified for a single reference, the values must be enclosed

c function,

Section 1.3.6 by the combine function or , c . Non-numeric character constants are included within quotes.

p. 15

In addition to the color themes, there are three provided color palettes for the bar colors colors option: specified with the option colors . These palettes are "rainbow" , "terrain" , and "heat" . Three more available color Randomly choose a color from the specified palette with random.col=TRUE .

palettes.

random.col

General appearance. By default the bars of the bar graph are vertical. To display the bars as option: Specify horizontal, set random colors. horiz=TRUE . The grid lines are by default behind the bars. To print the grid

lines over the bars, set horiz option: over.grid=TRUE . For a vertical bar graph, add more space between the

Orientation of

top of the highest bar and the border by increasing the default value of addtop=1 . To increase graph. the default gap between the bars for a single variable, plot add a value to the default value of addtop option: gap=.2 . Frequencies are plotted by default. Set prop=TRUE to display proportions instead of More vertical space. counts.

gap option:

To produce a bar graph requires the work of many R functions upon which BarChart relies, Change the bar particularly the R function barplot . Most of the parameter options for any of these constituent gap. functions ultimately used to produce a graph can also be passed directly to BarChart . Some of prop option:

the relevant options from the Specify R graphics function par are col.main , col.axis , and col.lab

proportions.

to specify the colors of the title, axes, and axis labels, respectively. To control the size of the par function: R axis labels, use cex.axis and cex.names for the size of the axis names, with the magnification graphics settings. factor of 1 as the default. Also, the graph’s margins can be set with the mar option, as explained by entering ?par at the command prompt.

Other options that can be passed to BarChart are from the base R function barplot . To barplot function: fill the bars with shading lines instead of solid colors use density specified in lines per inch. Standard R function that Use angle to change the angle from the default of 45 degrees.

underlies

Figure 4.2 presents an example of the same analysis as from Figure 4.1 , but with some BarChart.

BarChart options activated. The use of main specifies a custom title.

main option: The title of a graph.

lessR Input Bar chart with several options > BarChart(Dept, horiz=TRUE, col.grid="transparent", density=18,

main="Bar Chart of Employees in Each Department")

There are many available options to configure a specific graph. Find more examples at the end of the web page that results from ?BarChart . These examples can be manually copied from the web page and then pasted into the R console, or run automatically with example(BarChart) . Presumably the default value for each of these options is reasonable in example function: most situations, but regardless, re-specify as you wish, using the version of the bar graph in Run each example in the posted Figure 4.1 or Figure 4.2 , or any version you can design from the available options.

manual for the specified function.

4.2.4 A Bar Graph Directly from the Counts

In some situations the counts, frequencies, of each category are available, but not the original data.

Counts Manually Entered

First consider the situation in which the counts for each category are available, but manually entered into R .

84 Categorical Variables

Bar Chart of Employees in Each Department

SALE

y ed MKTG

N C FI

tment Emplo

Figure 4.2 Bar chart for a single categorical variable with some activated options.

Scenario Generate a bar chart directly from the counts The frequencies of occurrence of each of the values of the categorical variable Dept are available, but not the raw data from which the counts were obtained. Enter these counts into R and then generate the bar chart.

Construct a bar graph with BarChart from the frequencies of occurrence of each value, such as from the following frequencies.

ACCT ADMN FINC MKTG SALE

Frequencies:

c function, Section 1.3.6 , p. 15

To construct the bar chart, define two different vectors. The first vector contains the counts.

As is always true in R , specify a list of multiple values wrapped within the combine function

names function:

Provide names for

c . The second vector contains the names of each category, and uses the R names function to

the values of a vector or columns

associate the category names with the category counts. Then call BarChart to graph the counts,

of a data frame.

which results in the same graph as shown in Figure 4.1 .

Categorical Variables 85

lessR Input Bar chart of one variable directly from counts > Counts <- c(5, 6, 4, 6, 15) > names(Counts) <- c("ACCT", "ADMN", "FINC", "MKTG", "SALE") > BarChart(Counts, xlab=label(Dept))

The lessR function label retrieves the variable label and here sets it equal to the label for variable labels, the horizontal axis. If variable labels do not exist, or to specify another axis label, include the Section 2.4.1 , p. 46 desired character string in quotes for the xlab argument.

label function: Manually retrieve a variable label.

Counts Read Directly from a File

The counts can also be placed into a file and then directly read into R as a standard data file. Consider the data table in Listing 4.2 with two variables, Dept and Count. This information was read into R with the usual Read function.

> mydata Dept Count 1 ACCT

Listing 4.2 Counts for each category as read directly from a csv file.

To inform BarChart that the values read are counts, invoke the count.levels option, set count.levels

equal to the name of the corresponding categorical variable that contains the counts.

option: The name of the variable for which the listed counts pertain.

lessR Input Generate a bar chart directly from counts read as data > BarChart(Count, count.levels=Dept)

The figure generated by this R input is identical to the bar graph generated from the data values in Figure 4.1 .

4.2.5 Describe the Sample with a Pie Chart and Statistics pie chart:

Frequency plot of the values of a

An alternative to the bar chart of one variable is the pie chart in which each slice of the pie categorical variable represents a frequency of occurrence. In general the bar chart is considered easier to read than according to the

size of the slices of

the pie chart, but the pie chart remains popular.

a pie.

86 Categorical Variables

Scenario Generate a pie chart Generate a pie chart from the data, the recorded values of the categorical variable Dept,

the department at which an employee works.

PieChart

The lessR function for the pie chart is PieChart , with abbreviation pc .

function: Generate a pie chart and summary statistics.

lessR Input Pie chart > PieChart(Dept)

or

> pc(Dept)

The default lessR pie chart in gray scale is shown in Figure 4.3 for the categorical variable Dept.

Department Employed

ADMN FINC

Figure 4.3 Gray scale pie chart for categorical variable Dept.

4.2.6 Available Options for the Pie Chart

Different options for the PieChart function are described next, organized into groups according to their functionality.

Variable specifications. The variable plotted occupies the first position in the list of options. If the variable is in the data frame mydata , then the name of the corresponding data frame need not be specified. Otherwise, specify the relevant data frame with data .

Colors. The usual color themes do not apply to PieChart , except for the gray scale defined by

set function for

color="gray" . There are also three more color palettes defined by R . To access these palettes

color themes, Section 1.4.1 ,

set colors in the call to PieChart to either "rainbow" , "heat" , or "terrain" . Note that

p. 16

these palettes are not part of the system setting level from set , but instead are specific to only BarChart and PieChart .

Categorical Variables 87

Use col.fill to color each individual slice with its own color. Consider a bar chart for Gender with values Female and Male. In this example, specify salmon3 and seashell3 , respectively, for these values.

> PieChart(Gender, col.fill=c("salmon3","seashell3")) To specify multiple values for a single reference, present the values with the combine

function, c . Include non-numeric character constants within quotes.

c function, Section 1.3.6 , p. 15