Evaluate a Single Group Mean

6.2 Evaluate a Single Group Mean

research design:

To begin a research study, the researcher chooses a research design from which to obtain the

The procedures for collecting and

measurements of people, or whatever is the unit of analysis. For the designs considered here the

analyzing one or

data values are either from one sample or are from different samples, which are then compared

more samples of data.

with each other. The simplest research design is that of a single sample, the measurements of a single variable of interest. One characteristic often of interest is the mean. What is the average score on the class midterm? What percentage of people will vote for a particular candidate? What is the average score on an anxiety scale administered to clients about to undergo a particular type of therapy?

6.2.1 The Population Mean and a Sample Mean

The focus of the analysis is not the sample statistics per se, but the corresponding population

inferential

values of the variable of interest. For the analysis of the mean, the goal is to apply inferential

statistics: Analysis of sample data

statistics, to use information gleaned from the sample to evaluate the mean of the population

that provides

from which the sample was drawn. So the population mean of the class midterm would be the

conclusions regarding the true

mean if the same teaching conditions by which the class was taught were extended to many,

population values.

many thousands of similar students. The usually hypothetical population values are denoted with Greek letters. Refer to the population mean of the variable of interest by the Greek letter µ .

The sample results are not of general interest because their values are unique to that specific sample. Usually take only a single sample for the group of interest. Hypothetically, more random samples of the same size from the same population could be taken. The key issue is that every random sample will yield different sample statistics. Every sample mean to some extent reflects the value of the true underlying mean, µ , and also the influence of random sampling error.

For example, flip a fair coin 10 times and you might get 6 heads. Flip the same coin another

10 times and you might get 4 heads. Which sample result is true? The answer is neither of them. What is true is the long run average of the population mean over an indefinitely large number of flips. The long run average for a fair coin is that 50% of the results are heads, a result that fluctuates from sample to sample, more noticeably for small samples.

Mach IV scale,

Consider the responses to the Mach IV scale, in particular the responses to the 7th item,

Listing 1.8 , p. 27

named m07 in the data set. There is no excuse for lying to someone else. The responses are numerically encoded from 0 to 5 with a 0 indicating Strongly Disagree

reverse score,

and a 5 Strongly Agree . When considered as part of Mach IV, this item is reversed scored

Section 3.4.1 , p. 63

so that Disagreement is consistent with a high Machiavellianism score. For the analysis of this

single item, however, leave the responses unmodified to simplify the discussion of the results.

Histogram

function,

The analysis of m07 begins with a description of the sample results provided by the

Section 5.2 , p. 100

Histogram function. First read the data from the internal lessR data set Mach4 , as well as

variable labels,

the item content in the form of variable labels.

Section 2.4 , p. 45

> mydata <- Read("Mach4", format="lessR")

First try the default histogram for m07 .

> Histogram(m07)

Means, Compare Two Samples 125

Default histograms of Likert data are usually problematic because of the relatively small number of scale points that assess the underlying continuous variable, the extent of agreement with the item. In this example the default histogram of m07 has a bin width of only 0.5. Instead the bin width should be the increment between the successive scale values, here 1. Also, the histogram bars should be centered over the corresponding scale points. To accomplish these objectives, provide explicit values for the bin.start and bin.width options.

> Histogram(m07, bin.start=-.5, bin.width=1) The histogram of the responses of the 351 respondents is shown in Figure 6.1 .

There is no excuse for lying to someone else.

Figure 6.1 Histogram for m07, the 7th Mach IV item, not reverse scored. The histogram indicates that the responses vary across the possible range of values from

0 to 5. Values below 2.5 indicate Disagreement and values above 2.5 indicate Agreement. The summary statistics from Histogram in Figure 6.1 indicate that the sample mean of 2.8 is close to but larger than the midpoint of 2.5. In this particular sample the mean is 0.3 units above the dividing point of Disagree and Agree.

--- m07, There is no excuse for lying to someone else. --- n miss

Listing 6.1 Summary statistics for m07, the 7th Mach IV item.

The question of interest here is that of inferential statistics. Does this result of a sample mean larger than the midpoint of 2.5 generalize to the population? That is, is the true average

126 Means, Compare Two Samples

response to m07 also in the Agreement region? If additional samples from the same population were obtained, would the mean generally be in the Agree region?

Considered from another perspective, if the true population mean is right at the midpoint of µ= 2 . 5, then half the sample means would be below 2.5 and half would be above 2.5. The fact that the sample mean is larger than 2.5 does not imply that the true mean is larger than

2.5. Analysis of the underlying population value provides us with some probability information to evaluate the true value of the mean. There are two primary forms of inferential analysis, the confidence interval about the statistic of interest, and the hypothesis test of a specified value of the corresponding population value.

6.2.2 Inferential Analysis of the Mean

Apply this inferential analysis to the mean.

Scenario Inferential analysis of the mean The response to the Mach IV items follows a six-point Likert scale from 0 for Strongly Disagree to 5 for Strongly Agree. Evaluate the population value of the mean level of endorsement for the 7th item on the Mach IV scale. Base the hypothesis test on the scale value that separates the region of disagreement from agreement, 2.5.

Accomplish the lessR inferential analysis of a population mean with the function ttest , or its abbreviation tt . This function provides both classical forms of inference, the confidence interval about the sample mean and, if an hypothesized value is provided, the corresponding

effect size: The

hypothesis test. Also provided is an indicator of effect size, the detected magnitude of the

magnitude of the difference between

difference. An evaluation of the normality of the population from which the data values are

the sample and

sampled is provided for use when the sample size is small.

hypothesized results.

Use the mu0 option to specify the reference value for the hypothesis test. If mu0 is not specified, the confidence interval is calculated without the accompanying hypothesis test. The

mu0 option: Null

default confidence level is set with the conf.level option. The default value is 0.95.

hypothesis value.

The specifications of the null and alternative hypotheses depend on the deviations from

two-tailed test:

the specified null value that are of interest. For the two-tailed test in which deviations in either

Rejection region consists of both +

direction from the null value are of interest, the rejection region of the test consists of values

and − deviations

that are much larger than the null value and values that are much smaller. The two-tailed test

from the null value.

is the default analysis, as specified by the default value "two.sided" for the alternative

one-tailed test:

option.

Rejection region lies only on one

For the one-tailed test, deviations from the reference value of the null hypothesis are of

side of the null

interest only in one direction from the null value. The issue is not if the researcher prefers a

value.

large positive or a large negative deviation, but rather what the researcher is willing to interpret. For example, a consumer protection agency interested in assessing a gas mileage claim by an automobile manufacturer does not care if the average fuel mileage betters the claim. Instead

alternative

the agency is focused only on deviations significantly below the claimed mileage. If only

option: Specify a one- or two-tailed

deviations in one direction are of interest to interpret, then specify a one-tailed test, with the

test.

rejection on only one side of the null value. Specify the alternative hypothesis according to alternative="less" or alternative="more" .

Means, Compare Two Samples 127

In this example the population mean value of interest is 2.5. For a two-tailed test the null hypothesis is H 0 :µ= 2 . 5. The alternative hypothesis is that H 1 2 . 5. Specify the corresponding two-tailed test as follows.

lessR Input One-sample t-test > ttest(m07, mu0=2.5)

or

> tt(m07, mu0=2.5)

The first part of the output of ttest are the summary statistics of the variable of interest, the response variable. The Histogram output in Listing 6.1 includes these statistics. Included in this report is the sample mean of 2.8.

Next is an assessment of the normality of the population data from which the sample data

values are obtained. If the sample size is greater than 20 or 30, or lower values if the population central limit

data values are not too skewed, then the central limit theorem ensures a normally distributed theorem: The sample mean is sample mean across many hypothetical repeated samples from the same population, each sample approximately

of the same size. This normality is needed to justify the use of the normally t-distribution as the basis of

distributed unless a

the statistical inference.

small sample is

In this example the sample size is 351, well beyond the threshold of 30. Accordingly, taken from a ttest

non-normal

informs us that tests of normality are not needed, as shown in Listing 6.2 .

population.

------ Normality Assumption ------ Sample mean is normal because n>30, so no test needed.

Listing 6.2 Consideration of the normality of the response to m07, the 7th Mach IV item.

The beginning of the inferential analysis follows in Listing 6.3 , of which the first two lines are preliminary information for the analysis.

------ Inference ------ t-cutoff: tcut = 1.967

Standard Error of Mean: SE = 0.08

Listing 6.3 Preliminary information for statistical inference, both the hypothesis test and the confidence interval.

The first value is the conf.level option: t-cutoff for the specified confidence level, of which the default is

The confidence

95%. This value can be changed from the default of conf.level=0.95 . The t-cutoff specifies level. the range of sampling variability of the t-statistic over repeated samples, in terms of estimated t-cutoff: Positive

standard errors. The value is around 2 except for very small samples, and always larger than and negative

values define the

1.96, which is the baseline established by the normal distribution. Here the value of the cutoff range of sampling for the two-tailed test is 1.967. variability.

The standard error is the standard deviation of the sample mean over these repeated standard error: hypothetical samples and sets the baseline for the extent that the statistic fluctuates from sample Standard deviation

of a statistic over

to sample. The more fluctuation, the more error is likely to be in the estimation of the true mean, usually hypothetical µ . The “magic” is that this information can be assessed from the information in only a single repeated samples.

128 Means, Compare Two Samples

sample, the sample of data subject to analysis by the ttest function. Here the standard error of the sample mean is 0.08, which is always less than the standard deviation of the data, here

2.8 as shown in Listing 6.1 .

t-value of the

The core of the inferential analysis follows, beginning with Listing 6.4 . As specified by

mean: Number of estimated standard

the obtained t-value, the corresponding sample mean of 2.8 is a considerable 3.497 estimated

errors the sample

standard errors from 2.5.

mean is from the hypothesized mean.

Hypothesized Value H0: mu = 2.5 Hypothesis Test of Mean: t-value = 3.497,

df = 350, p-value = 0.001

Listing 6.4 Hypothesis test that the true mean response is 2.5 for m07, the 7th Mach IV item.

Given the degrees of freedom of df = 350, one less than the sample size of 351, the

p-value:

corresponding p-value is 0.001. The p-value is the probability of obtaining a sample mean that

Probability of obtaining a sample

is 0.27 units away from the hypothesized mean of 2.5, here in either direction, if the null

statistic as deviant

hypothesis is true. If the 2.5 is the true mean, the result of a sample mean of 2.8 is quite

or more deviant from the null

unlikely.

hypothesized value

How low can the p-value be before the null hypothesis is considered unlikely and is therefore

assuming a true null.

rejected. The definition of “unlikely” is the given value of α , usually 0.05, but sometimes 0.01

statistical

or 0.10. Here the statistical decision follows.

decision: The rejection or not of the null hypothesis.

Difference from 2.5: p -value = 0 . 001 <α= 0 . 05 , so reject H 0

significant

Reject the null hypothesis as unlikely. A significant difference from 2.5 has been detected. Note

difference: A likely difference has

that the probability of the truth of the null is not known. What is known is that if it is true, then

been detected

an unlikely event occurred, so the null is probably not true. One limitation of the hypothesis

between the true mean and the

test is that we do know the p-value, but we do not know the more useful probability of the truth

hypothesized

of the null hypothesis. The p-value can be computed to as many decimal digits as desired, but

mean.

our understanding of the likelihood of the null hypothesis is qualitative. We conclude that the null hypothesis is unlikely without any more specific probability information.

Listing 6.5 presents the confidence interval. Given our conclusion that the true mean is not

2.5, what is it? The randomness inherent in the sampling process prevents a precise answer. What is possible is to provide an interval that likely includes the true mean, µ .

Margin of Error for 95% Confidence Level: 0.15 95% Confidence Interval for Mean:

2.62 to 2.93

Listing 6.5 Confidence interval for m07, the 7th Mach IV item.

confidence

The 95% confidence interval is from 2.62 to 2.93. The confidence interval is the range of

interval: Range of values that likely

values that likely contains the true mean, µ , at the specified level of confidence, 95%. Here the

contains the

true average response is in the Agreement region.

population value of interest.

The information from the confidence interval is consistent with that of the hypothesis test. The confidence interval indicates that the true mean response on the six-point scale from 0 to

Means, Compare Two Samples 129

5 for m07 is likely between 2.62 and 2.93, of which all values in this range are larger than the null value of 2.5. Consistent with this information, the hypothesis test indicates that the mean value of 2.5 is unlikely, a value outside of the confidence interval.

Also of interest is the effect size. The p-value indicates if a difference between population and effect size hypothesized values of the population has been detected, but provides no information regarding Section 6.2.2 , p. 126 the extent of this difference. To report a significant p-value, that is, a p-value less than α , without reporting an effect size is to inform the reader that there is a difference, but with no indication as to its size.

The ttest function reports effect size in two different metrics. First, consider the units of measurement of the variable of interest. The difference of the sample mean from the hypothesized value of the mean estimates this difference, and can be extended to the lower and upper bounds of the accompanying confidence interval.

Another indicator of effect size is in standardized units, the sample distance of sample and hypothesized means divided by the standard deviation of the data. This standardized indicator is Cohen’s d after Jacob Cohen (1969). Standardization is particularly useful when the original Mach IV data, measurement unit is arbitrary, such as for Likert responses including the responses to the items Listing 1.8 , p. 27 on the Mach IV scale.

The function ttest reports both indicators of effect size, as shown in Listing 6.6 . The raw distance of the sample from the hypothesized mean is 0.27 units on the six-point scale. The corresponding standardized value is 0.19, that is, the sample mean is 0.19 standard deviations above the hypothesized mean of 2.5.

------ Effect Size ------ Distance of sample mean from hypothesized: 0.27

Standardized Distance, Cohen’s d: 0.19

Listing 6.6 Effect size for the distance of the sample mean of m07, the 7th Mach IV item, from the hypothesized value of 2.5.

To assist in the understanding of the distribution of the variable of interest, and the resulting effect size, ttest also provides a smoothed plot of the distribution, a density plot, shown in Figure 6.2 . Included in the plot are vertical lines that represent the sample mean of the density plot, distribution and the corresponding hypothesized mean. The two indicators of effect size are Section 5.5 , p. 114 also displayed. The bottom horizontal axis displays the original metric in which the variable is measured. The top horizontal axis displays the metric of Cohen’s

d, the standard deviation of the data. For the responses to the Mach IV m07 , the population mean is apparently in the Agree region, so that on average the respondents agree that lying to others is bad. However, the mean sample mean response is only 0.27 units above the midpoint of 2.5 on the six-point scale from

0 to 5. Even the upper end of the 95% confidence interval is below 3, with a value of 2.93. The histogram in Figure 6.1 and the density plot in Figure 6.2 indicate the responses vary over the complete range of data values. Further, the effect size is rather small. The true mean is in the Agree region, but the bimodal shape of the distribution implies that the mean is not an effective summary of all the data values.

130 Means, Compare Two Samples

0 1 2 0.19 d

0.27 diff

0 2 4 6 There is no excuse for lying to someone else. Figure 6.2 Density plot of m07, the 7th item on the Mach IV scale, with sample mean and hypothesized

mean and two effect sizes, not reverse scored.

6.2.3 Inference from Summary Statistics

The t-test can also be conducted directly from the three summary statistics that form the basis of the test. Just specify a value for n , m , and s , the sample size, the sample mean, and the sample standard deviation, respectively.

> ttest(n=351, m=2.77, s=1.47, mu0=2.5)

Invoking this statement results in exactly the same output as shown previously, except for the portions of the output that depend directly on the availability of the data. The density curves cannot be analyzed, nor is a test for normality of the data performed.