Two Categorical Variables

4.3 Two Categorical Variables

4.3.1 The Bar Chart and Joint Frequencies

The BarChart function for two categorical variables calculates and displays the associated joint frequencies and then constructs the bar chart from those frequencies.

Scenario Generate a bar chart for two variables To show the relation between two categorical variables, generate a bar chart for the

department of employment and job satisfaction.

The primary variable plotted is always the first value passed to BarChart . If there is a second by option: Specify variable to plot, then either place it in the second position, or specify it with the by option,

a second variable.

usually still in the second position although not necessarily. The levels of each categorical variable are listed in alphabetical order by default. For Gender, that means that the values are listed as Female and then Male. The values of Satisfaction, however, are ordered and so Satisfaction should be defined as an ordinal variable with data values ordered from low to med to high. As this ordering is not alphabetical, employ the functions order levels of a Transform and factor to specify the desired ordering and also to define Satisfaction as ordinal. factor, Section 3.3.2 ,

p. 60

lessR Input Bar chart for two categorical variables with ordered levels > mydata <- Transform(Satisfaction=factor(Satisfaction,

levels=c("low","med","high"), ordered=TRUE))

> BarChart(Gender, by=Satisfaction)

Figure 4.4 gives an example of a two-variable bar graph. Because variable labels are present, they appear instead of the shorter variable names. Some of the text output of BarChart follows. First reported are the variable names in the analysis and, if present, the accompanying variable labels.

Gender, Male or Female by Satisfaction, Degree of Satisfaction with Work Environment

88 Categorical Variables

Satisfaction w ith Work

F M Male or Female

Figure 4.4 Default bar chart for two categorical variables, one variable with ordered values.

joint frequency: Frequency of

The focus of analysis for the relation of two categorical variables is the joint frequency, the

occurrence of the combination of

count of how many times two specific values in the same row of the data file, one for each of

two values of a

the variables, occur together. All of the joint frequencies for each pair of values are presented

categorical variable.

in the cross-tabulation table, the table of the joint frequencies of the values of two or more

cross-tabulation

categorical variables, shown in Listing 4.3 .

table: Table of joint frequencies.

Joint and Marginal Frequencies ------------------------------

Gender Satisfaction F M Sum

Listing 4.3 Cross-tabulation table, the joint and marginal frequencies from BarChart.

For example, two women reported low Satisfaction, but 11 men reported the same low

marginal

Satisfaction.

frequency: Row

The cross-tabulation table from BarChart also contains the marginal frequencies. These

or column sum of a cross-tabulation

marginal frequencies are the frequencies of each of the two variables considered in isolation

table.

from the other. In Listing 4.3 the marginal frequencies appear in the row or column labeled

grand total: The

Sum .

total number of observations in the

The number in the bottom right corner of Listing 4.3 , 35, is the sum of the row and column

table.

sums, the grand total. It is the total number of cases in the entire sample that have data values

Categorical Variables 89

recorded for both specified variables. In this data set there are data for 37 employees, so two employees have at least one data value missing for either Gender or Satisfaction or both.

4.3.2 Generalize beyond the Sample with Inferential Analysis

independent

events:

As with the inferential analysis for one categorical variable, the inferential analysis for two Occurrence of one categorical variables is also based on the chi-squared statistic. For two categorical variables event unrelated to

the probability of

the test evaluates their independence according to the null hypothesis that the variables are the occurrence of another. unrelated, or independent. Is the tendency to be classified in a particular category of the first

variable related to a specific classification in the second category? chi-square

test of

independence: Test to evaluate if

Scenario Evaluate the relationship between two categorical variables two categorical

variables are

Gender has two categories, Male and Female. Do Males tend to be more or less satisfied related. with their job at this company than Females? Satisfaction here is assessed with responses of 1, 2, and 3, so both variables are categorical.

The test is of the relation between Gender and Satisfaction. If the variables are unrelated, that is, independent, then knowing a person’s Gender conveys no information regarding his or her perceived level of Satisfaction. If the variables are related, then one of the Genders has a stronger tendency to be Satisfied than the other Gender.

Before performing the inferential analysis of the hypothesis test first look at the sample results. As seen from the bar graph and from the table, in this sample at least, men tended more to be dissatisfied and women tended toward more satisfaction. As always with inferential analysis, the question regards the extent the relationship observed in the sample generalizes to the population as a whole from which the sample was obtained.

The BarChart function for two variables also yields the chi-square test for independence. The chi-square test of independence is an example of an inferential test. This inferential analysis appears in Listing 4.4 .

Chi-square Analysis ------------------- Number of cases in analysis: 35 Number of variables: 2 Test of independence:

Chisq = 9.300699, df = 2, p-value = 0.0096

null hypothesis: Assumption of no

Listing 4.4 Inferential chi-square test of the null hypothesis of no relation between Gender and Satisfaction.

relationship between the variables.

In this situation, the obtained chi-square statistic is 9.30. The assumption on which the sampling error: test is based is the null hypothesis, denoted as H 0 . If the cell proportions reflect no relationship Impact of the

between the two variables, then the chi-square statistic would be exactly zero. In real data, randomness

inherent in any

however, there is sampling error. Even when the null hypothesis of independence is true, the one sample. chi-square statistic is virtually always larger than zero. The question is how much larger than p-value: If the null zero is reasonable if the null hypothesis of no relationship is true?

hypothesis is true, the probability of

Assess how large the chi-square statistic is in terms of its probability of getting a chi-square obtaining a result value as larger or larger than what was obtained in this sample, assuming that there is no as deviant or more

than the obtained

relation between Gender and Satisfaction. The p-value provides the answer. The p-value reflects result.

90 Categorical Variables

the probability of what was obtained, which is then compared to usual cutoff value that defines

alpha level

a low probability, the alpha level, α= 0 . 05.

function, Section 4.2.2 ,

If the p-value is larger than α , then the probability is sufficiently high that the obtained

p. 82

result is considered consistent with the null hypothesis. If the p-value is smaller than α , an unlikely event occurred assuming that no relation actually exists, so reject the hypothesis of no relation as implausible.

test of no relationship: p -value = 0.0096 < α = 0 . 05 , so reject H 0

If the null hypothesis of no relation between Gender and Satisfaction is true, then the low p-value of 0.0096 indicates that an unlikely event occurred. We conclude that the variables are related such that men are more dissatisfied with this work environment than women, not just in this sample, but in the population as a whole. The sample results generalize. Note, however, that although the p-value is precisely computed, it is a conditional probability in that it specifies the probability of the results if the null hypothesis is true. From the low p-value we conclude that the variables are likely related, but the probability of this relationship is not known. The p-value is quantitative, but the conclusion that the null is unlikely is qualitative. We do not know the probability that the null hypothesis of no relation is true.

4.3.3 Available Options for the Two-variable Bar Chart

General appearance. In addition to the option for displaying the bars horizontally, the bars at each level of the first variable may be displayed side by side instead of the default of stacked on top of each other. To display the bars side by side, set beside=TRUE .

Colors. Set the color of the bars individually with col.fill . Randomly choose a color from the specified palette with random.col=TRUE . With two variables, the usual color theme from the set function does not apply, but R defines three color palettes, which for BarChart only apply to two-variable bar charts. Access these palettes by setting colors as a BarChart option to "rainbow" , "heat" , or "terrain" . The most vivid color palettes are rainbow and heat .

Legend. A two-variable graph generates a legend, which indicates the color of the bars of the by variable when plotted at each level of the first variable. By default the legend appears to the right of the main graph. Change the location of the legend to somewhere on the graph itself with legend.loc , which can assume one of the following values: "bottomright" , "bottom" , "bottomleft" , "left" , "topleft" , "top" , "topright" , "right" , and "center" . To change to a horizontal orientation, set legend.horiz=TRUE . By default the labels in the legend are the

c function,

values of the by variable. To specify custom values provide a list of values to labels.legend .A

Section 1.3.6 , p. 15

list must always be specified with the combine function c , such as labels.legend=c("Label 1", "Label 2") .

Text output. The cell frequency can be divided by the total sample size or the corresponding row or column total. To obtain the corresponding three tables of proportions, set brief=FALSE .

Categorical Variables 91

Satisfaction is a categorical variable with ordered categories. Accordingly, BarChart plots the three levels of Satisfaction, from low to medium to high, as an ordered progression of a single hue.

4.3.4 A Bar Graph Directly from the Counts

As with the bar graph of a single variable, the bar graph of two variables can be constructed directly from the counts.

Scenario Construct a bar graph of two variables from the counts Given a table of joint frequencies for two variables, construct the bar graph.

Consider the table of counts, Table 4.1 .

Table 4.1 Joint frequencies.

Enter the table of joint frequencies directly into R with the R matrix function. With colnames

byrow=TRUE , the counts are entered row by row, with the specifications of 3 rows and 2 columns function: Name the column values. according to nrow=3 and ncol=2 . Use the R colnames and rownames functions to provide the names of the categories. The label on the horizontal axis is the name of the column variable. In this example, Gender is the column variable as specified with the xlab option. Call BarChart

with the matrix of counts. The title of the legend is the name of the row variable, Satisfaction. rownames

function: Name the row values.

lessR Input Bar chart of two variables from counts > Counts <- matrix(c(2,11, 7,4, 8,3), nrow=3, ncol=2, byrow=TRUE)

> colnames(Counts) <- c("F", "M") > rownames(Counts) <- c("low", "med", "high") > BarChart(Counts, xlab="Gender", legend.title="Satisfaction")

The result of the preceding four lines of R code is the same graph that appears in Figure 4.4 with one difference. With the counts entered directly, BarChart is unaware that the corresponding counts represent ordinal data with the categories of Satisfaction ordered from low to med to high. The result is that the bar chart displays the levels of Satisfaction in a different

hue for each level, as opposed to an ordered progression of hues when constructed from the factor function,

Section 1.6.3 ,

data in which the levels of Satisfaction are ordered according to the factor function.

p. 22

92 Categorical Variables

4.3.5 Cell Proportions

The purpose of BarChart for two variables is to provide their plot as well as the corresponding joint frequencies. The information provided by the joint frequencies, however, can be analyzed several different ways in terms of the way in which the sample proportions are calculated. The

by option: Specify

SummaryStats function provides these additional analyses.

a second variable.

lessR Input Analysis of sample proportions for two categorical variables > SummaryStats(Gender, by=Satisfaction)

The first set of sample proportions, in Listing 4.5 , are the sample probabilities that both the values of Gender and Satisfaction for a randomly sampled person will represent data in one of these 6 cells of joint frequencies.

Cell Proportions and Marginals ------------------------------

low 0.057 0.314 0.371 med 0.200 0.114 0.314 high 0.229 0.086 0.314 Sum 0.486 0.514 1.000

Listing 4.5 Overall cell proportions from SummaryStats.

Each cell proportion is the corresponding joint frequency divided by the entire sample size. For example, two women reported low Satisfaction, which is 2 / 35 = 0 . 057 or 5.7%. Eleven men reported low Satisfaction, which is 11 / 35 = 0 . 314 or 31.4% of all of the 35 employees represented in these data. We also see, for example, that 48.6% of all 35 employees are women and 51.4% are men.

Another way to compute the cell proportions is to express each cell value as the ratio of the corresponding joint frequency divided by the corresponding column marginal sum. If our interest is to compare Satisfaction levels for men and women, then this table in Listing 4.6 is relevant because we can see how the probabilities of different levels of Satisfaction change for

conditional

women and men. These proportions represent sample conditional probabilities. Each probability

probability: Probability of one

for each level of Satisfaction depends upon, is conditioned upon, the respective event that the

event assuming the occurrence of another event.

Proportions within Each Column ------------------------------

Gender

Satisfaction

low 0.118 0.611 med 0.412 0.222 high 0.471 0.167 Sum 1.000 1.000

Listing 4.6 Column proportions from SummaryStats.

Categorical Variables 93

person is a woman or that the person is a man. For example, if the employee in this sample is

a woman, then the probability that she reports low satisfaction is 0.118. The table that presents the proportions within each row is based on the same logic as the previous table, but now focuses on the conditional probabilities with the Satisfaction level as the given, or conditioned, information. The result is in Listing 4.7 . If a person has low Satisfaction, then the probability that the employee is a women is only 0.154, but rises to 0.846 for men.

Proportions within Each Row ---------------------------

Gender Satisfaction

Listing 4.7 Row proportions from SummaryStats.