Onward to the Third Dimension

4.4 Onward to the Third Dimension

The lessR functions BarChart and SummaryStats process one or two variables. To generate a table function: cross-tabulation table of the joint frequencies of three variables requires moving directly to the Generate a cross-tabulation R table function upon which BarChart relies.

table.

4.4.1 Example 1: Employee Data Set

To illustrate, consider the three categorical variables in the employee data set: Dept, Satisfaction, Employee data and Gender.

table, Figure 1.7 , p. 21

Scenario Analyze the relation of three categorical variables What is the relationship of the variables Department of employment, Satisfaction, and Gender in the Employee data set?

First read the data into the R data frame mydata . > mydata <- Read("Employee", format="lessR") Because R does not directly reference variables in the data table by their names, use the R

with function to indicate the name of the data frame so that the data frame name only has to

be entered once.

lessR Input Create a three-way cross-tabulation table > with(mydata, table(Dept, Satisfaction, Gender))

94 Categorical Variables

with function:

The with function allows the variable names to be entered directly, which becomes

Identify the data frame for standard

convenient when multiple variables are referenced in a single function call. The result is

R functions.

Listing 4.8 .

, , Gender = F

, , Gender = M

low med high

Dept

low med high

Listing 4.8 Cross-tabulation table for three categorical variables.

three-way

Listing 4.8 presents a three-way cross-tabulation table to display information from three

cross-tabulation

dimensions with two-dimensional tables. Each table of joint frequencies displays variables Dept

table: Joint

frequencies of

and Satisfaction at one level of Gender. The particular form of the resulting tables depends on

three categorical variables.

the entered order of the variables in the call to the table function, here with Gender entered last.

At least in this sample we see from the first table of joint frequencies that women tend to have medium to high Satisfaction. Compare the two tables to observe that women are more prominent in the Marketing department than men. Men are concentrated in the Sales department with a deep level of dissatisfaction. Women also are well represented in the Sales department, albeit with a tendency to be more satisfied.

mosaic chart:

We can also view this information graphically. A graphical analysis of three categorical

Plot of joint frequencies based

variables is the mosaic chart, an extension of the stacked bar chart for two variables to the

on rectangular

multidimensional equivalent of three or more dimensions. The mosaic chart breaks up a square

regions.

into regions with the area of each region proportional to a frequency, either from a cell or a margin of a joint frequency table. The stacked bar chart for two variables essentially accomplishes

mosaic function:

this division of each bar in the bar chart. The mosaic chart extends this pattern to more variables.

Produce a mosaic

Although R contains a function for producing mosaic plots called mosaicplot , an improved

chart.

version exists called mosaic in the vcd package. The abbreviation vcd is for Visualize Categorical

install.packages function,

Data. Of course this means that the package must first be installed with the install.packages

Section 1.2.3 , p. 5

function. Before accessing the function, invoke the library function.

library function,

The mosaic function generates the mosaic chart for Dept, Satisfaction, and Gender that

Section 1.3.1 , p. 7

corresponds to the three-way joint frequencies illustrated in Listing 4.8 . To specify the variables, begin the variable list with a tilde, ˜ , found in the upper left hand corner of the standard keyboard and then the three variable names separated by plus signs. To highlight cells of the resulting graph that indicate a potential relationship, turn on the shade option. Specify the data frame with the data parameter, so set to mydata .

lessR Input Mosaic three-way association plot > library(vcd)

> mosaic( ∼ Dept + Satisfaction + Gender, shade=TRUE, data=mydata)

Categorical Variables 95

F Pearson residuals:

MF ● Gender ●

SALE

−1.51 p−value =

Figure 4.5 Default mosaic chart for three categorical variables.

The variable Dept is on the left side of the graph, Satisfaction on top, and Gender on the right side. The darker areas represent larger deviations from the expectation of the null hypothesis that there is no relation between the three variables plotted. The mosaic chart reveals that the largest deviations from the null are the low Satisfaction for men in the Sales department and the high Satisfaction of women in Marketing.

The graph also presents the p-value for the null hypothesis of no relationship. test of no relationship:

p -value = 0 . 0103 <α= 0 . 05 , so reject H 0

If the null hypothesis is true, the low p-value indicates that an unusual event occurred. So R conclude that the null hypothesis is likely formula: An R not true, and that the variables are related as discussed.

expression that specifies a model.

response

variable: The variable in a model

4.4.2 Example 2: Survivors of the Titanic

that is explained by the remaining

The next use of the mosaic function is our first use of an R formula, which provides a means variables.

for specifying a model, a functional relationship among variables. The expression of a model predictor

includes a variable of interest called the variable: A response variable, or outcome variable or dependent

variable used to

variable. This variable is explained in terms of one or more other variables called explanatory predict or explain the response variables or predictor variables.

variable.

96 Categorical Variables

Scenario Analyze survivors on the Titanic by Class of travel and Age Specify a model to explain one variable, Survived, with values Yes or No, in terms of two

other categorical variables, the Class in which the person was traveling, 1st, 2nd, or 3rd, and his or her Age, a child or an adult.

The general form of a formula follows, here illustrated for two explanatory variables, where Y is the response variable of interest and X1 and X2 are the explanatory variables.

Y ∼ X1 + X2

Models usually have anywhere from one to five or six explanatory variables. The model from the previous example had no response variable, which is why the formula began with the tilde with no variable in front of it. This lack of a response variable implies that the purpose of the analysis is not to explain the values of one variable in terms of others, but just to visualize the resulting joint frequencies. Place a variable in front of the tilde to define

a response variable, in which case the rectangles in the mosaic chart reflect the corresponding frequencies of this response variable. In this example we look at the different classifications of the survivors of the RMS Titanic, the grand passenger ship that struck an iceberg and sunk in the North Atlantic on its maiden voyage on April 14, 1912. The data, in the form of cross-tabulation tables, is available as the data set called Titanic in the R datasets package, which is automatically loaded into memory when an R session starts. To view the variables and their levels as shown in Table 4.2 , reference the corresponding help file for Titanic , that is, enter ?Titanic .

Table 4.2 Categorical variables and their values for the Titanic data.

1st, 2nd, 3rd, Crew

2 Sex

Male, Female

3 Age

Child, Adult

4 Survived

No, Yes

The goal is to account for, or explain, who survived on the basis of their Class of travel and their Sex, that is, Gender. To implement the formula for this analysis, place the variable Survived to the left of the tilde. Class and Sex are the explanatory variables. In this situation the mosaic plot will shade each cell according to the percent who survived and those who did not. We use a dark gray value of gray42 chosen to represent those who did survive and a lighter gray of gray85 for those who did not. The values of the variable Survive, No and Yes , are ordered alphabetically so the first color listed is for those who did not survive.

Categorical Variables 97

lessR Input Mosaic plot for relating one variable in terms of two others > library(vcd)

> mosaic(Survived ∼ Class + Sex, highlighting_fill=c("gray85","gray42"), data=Titanic)

The mosaic plot, the graphical version of the cross-tabulation table, is given in Figure 4.6 .

Yes Survived

w Cre

Yes

Figure 4.6 Mosaic chart for survival on the Titanic such that the darker gray indicates the proportion of the survivors.

One characteristic of the data revealed by the graph is the length of the top edge of the boxes for women in 1st, 2nd, and 3rd class. There were proportionally more women in 1st class than in 3rd class, a pattern necessarily reversed for the men. Almost all of the women in 1st class, and most in 2nd class survived, but not so for the women traveling in 3rd class. Regardless of the class of travel, more men died than survived. Still traveling 1st class was an advantage because the proportion of men who did not survive in 1st class was smaller. Only a small percentage of the crew were women, but unlike their much more numerous male counterparts, most survived. 3-way

The mosaic plot is an excellent visualization tool for viewing the relationship among cross-tabulation multiple categorical variables, particularly compared to the alternative of staring at the numbers table, Listing 4.8 , p. 94 in, for example, a three-way cross-tabulation table.

98 Categorical Variables

Worked Problems

The psychology department at a public university was interested in understanding more about their in-state and out-of-state students. The origin of their current students was classified as in- state, out-of-state-USA, and international. Also available was each student’s gender and choice of major of either psychology as a social science or as a biological science.

The data are available at http://lessRstats.com/data/psych.csv 1 One categorical variable.

(a) Show with statistics and a bar chart how many students fit each of the three classifications regarding origin. (b) Do the chi-square test of equal proportions. What is the conclusion?

2 Two categorical variables. (a) Is there a relation between gender and origin of student? Show with descriptive statistics

and a bar chart. (b) Analyze with the chi-square test of independence. What is the conclusion?

3 Three categorical variables. Is there a relation between gender, origin of student, and choice of major? Show with descriptive statistics and a bar chart.