Department of Statistics STA1501 Descrip (1)

CONTENTS

  ORIENTATION

  vi

STUDY UNIT 1

  1.1 Introduction

  What is Statistics?

  1.2 Types of Data and Information

  1.3 Self-correcting Exercises for Unit 1

  1.4 Solutions to Self-correcting Exercises for Unit 1

  1.5 Learning Outcomes

  1.6 Study Unit 1: Summary

STUDY UNIT 2

  2.2 Graphical and Tabular Techniques to describe Nominal Data

  2.3 Graphical Techniques to Describe Interval data

  2.4 Describing the Relationship between Two Variables and Describing Time Series Data 24

  2.5 Self-correcting Exercises for Unit 2

  2.6 Solutions to Self-correcting Exercises for Unit 2

  2.7 Learning Outcomes

  2.8 Study Unit 2: Summary

STUDY UNIT 3

  3.2 Graphical Excellence and Graphical Deception

  3.3 Presenting Statistics: Written Reports and Oral Representations

  3.4 Measures of Central Location

  3.5 Measures of Variablity

  3.6 Self-correcting Exercises for Unit 3

  3.7 Solutions to Self-correcting Exercises for Unit 3

  3.8 Learning Outcomes

  4.7 Solutions to Self-correcting Exercises for Unit 4

  4.8 Learning Outcomes

  4.9 Study Unit 4: Summary

STUDY UNIT 5

  5.2 Methods of Collecting Data and Sampling

  5.3 Sampling Plans

  5.4 Sampling and Nonsampling Errors

  5.5 Self-correcting Exercises for Unit 5

  5.6 Solutions to Self-correcting Exercises for Unit 5

  5.7 Learning Outcomes

  5.8 Study Unit 5: Summary

STUDY UNIT 6

  6.2 A basis for probability

  6.3 Sophisticated methods and rules in probability theory

  6.4 The rule of Bayes

  6.5 Learning Outcomes

  6.6 Study Unit 6: Summary

STUDY UNIT 7

  7.1 Introduction

  7.2 Discrete probability distributions

  7.3 Bivariate distributions

  7.4 Binomial distribution

  7.5 Poisson distribution

  7.6 Learning Outcomes

  7.7 Study Unit 7: Summary

STUDY UNIT 8

  8.1 Introduction

  8.2 Continuous probability distributions

STUDY UNIT 9

  9.1 Introduction

  9.2 Sampling distribution of the mean

  9.3 Sampling distribution of a proportion

  9.4 Sampling distribution of the difference between two means

  9.5 Self-Correcting Exercises for Unit 9

  9.6 Solutions to Self-Correcting Exercises for Unit 9

  9.7 Learning Outcomes

  9.8 Study Unit 9: Summary

STUDY UNIT 10

  10.1 Introduction

  10.2 Concepts of Estimation

  10.3 Estimating the Population Mean when the Population Standard Deviation is Known 192

  10.4 Selecting the Sample Size

  10.5 Self-correcting Exercises for Unit 10

  10.6 Solutions to Self-correcting Exercises for Unit 10

  10.7 Learning Outcomes

  10.8 Study Unit 10: Summary

STUDY UNIT 11

  11.1 Introduction

  11.2 Concepts of Hypothesis Testing

  11.3 Testing the Population Mean when the Population Standard Deviation is Known

  11.4 Calculating the Probability of a Type II error

  11.5 The Road Ahead

  11.6 Self-Correcting Exercises for Unit 11

  11.7 Solutions to Self-Correcting Exercises for Unit 11

  11.8 Learning Outcomes

  11.9 Study Unit 11: Summary

STUDY UNIT 12

  12.1 Introduction

ORIENTATION

Introduction

  Welcome to STA1501. This module consits of the first half of the first-year statistics course for students in the College of Economic and Management Sciences. The two modules form an integrated whole and are focused on the following objective: To collect, organise, analyse and interpret data for the purpose of making better decisions.

  First−year Statistics

  This is where you are

  The first part of the module covers the “Descriptive Statistics” part, which is earthly and real and the focus is on the presentation of data. The first step is to carefully think about the type of variable that each measurement represents. This is extremely important as the type dictates what you can or can’t do in the rest of your data analysis. Then we will also consider the collection of data (which most often, for the social sciences and in business applications, involve administering questionnaires andor survey data, and sampling plays an important role in this regard). Between the collection

  of data and the ultimate goal of analysis of data lies the very important step of organising and

  summarising the data. So, in this module we discuss how we organise and summarise the gathered information intelligibly and efficiently.

  a success p . These critical values are called parameters. We most often don’t know what the values of the parameters are and thus we cannot “utilise” these distributions (i.e. use the mathematical formula to draw a probability density graph or compute specific probabilities) unless we somehow estimate these unknown parameters. It makes perfect logical sense that to estimate the value of an unknown population parameter, we compute a corresponding or comparable characteristic of the sample.

  The objective is to draw inference about a population (a complete set of data) based on the limited information contained in a sample. In dictionary terms, inference is the act or process of inferring; to infer means to conclude or judge from premises or evidence; meaning to derive by reasoning. In general, the term implies a conclusion based on experience or knowledge. More specifically in statistics, we have as evidence the limited information contained in the outcome of a sample and we want to conclude something about the unknown population from which the sample was drawn. The set of principles, procedures and methods that we use to study populations by making use of information obtained from samples is called statistical inference.

Learning objectives

  There are very specific outcomes for this module which we list below. Throughout your study of this module you must come back to this page, sit back and reflect upon these outcomes, think them through, digest them into your system and feel confident in the end that you have mastered them.

  • Analyse data considering different types of data and how they relate to relevant graphical and

  tabular presentations e.g. pie charts, bar charts, histograms, stem-and-leaf displays, line charts, scatter diagrams and box-and-whisker plots

  • Analyse data by calculating accurate numerical measures of central location, variability, relative

  standing and linear relationship. • Differentiate between simple random sampling, stratified random sampling and cluster sampling

  and implement a sampling plan for a given research problem with an awareness for the effect of sampling errors.

  • Describe the different concepts and laws of probability and apply definitions of joint, marginal and

  conditional probability. • Apply the complement, multiplication and addition rules and probability trees for calculation of

  more complex events and calculate complicated events from the probabilities of related events. • Understand the role of probability in decision making and the application in basic statistical

  inference. • Describe random variables and the probabilities associated with them in the form of a table,

  formula or graph and also in terms of its parameters, usually the expected value and the variance. • Describe different probability distributions as either discrete or continuous and know the

  parameters of expected value and variance

The prescribed textbook

  For this module you must study twelve chapters from the prescribed textbook:

  Keller,G. (2009, (8 th edition)) Managerial Statistics, South–Western, Cengage Learning

  Chapter 8: CONTINUOUS PROBABILITY DISTRIBUTION Chapter 9 : SAMPLING DISTRIBUTIONS Chapter 10: INTRODUCTION TO ESTIMATION Chapter 11: INTRODUCTION TO HYPOTHESIS TESTING Chapter 12: INFERENCE ABOUT A POPULATION

The study guide

  The study guide is exactly what its name implies: a guide through the textbook in a systematic way. The textbook will focus on the theoretical contents of the module and we have tried not to duplicate material from the textbook in the guide. For each separate study unit you should first study the work in the textbook and utilise the guide to assess your progress, test your knowledge and prepare for the examination. In other words, the study guide will provide you with an opportunity to apply your knowledge of the material that is covered in the textbook. This study guide serves as an interactive workbook, where spaces are provided for your convenience. Should you so prefer, you are welcome to write and reference your solutions in your own book or file, if the space we supply is insufficient or not to your liking.

Study units and workload

  We realise that you might feel overwhelmed by the volumes and volumes of printed matter that you have to absorb as a student! How do you eat an elephant? Bite by bite! We have divided the twelve chapters of the textbook into 12 study units or “sessions”. Make very sure about the sections in each study unit since some sections of the textbook are not included and we do not want you frustrated by working through unnecessary work. The study units vary in length but you should try to spend on average 12 hours on each unit. Practically everybody should be able to do statistics. It depends on the amount of TIME you spend on the subject. Regular contact with statistics will ensure that your study becomes personally rewarding.

Try to work through as many of the exercises as possible

  Doing exercises on your own will not only enhance your understanding of the work, but it will give you confidence as well. Feedback is given immediately after the activity to help you check whether you understand the specific concept. The activities are designed (i.e. specific exercises are selected) so Doing exercises on your own will not only enhance your understanding of the work, but it will give you confidence as well. Feedback is given immediately after the activity to help you check whether you understand the specific concept. The activities are designed (i.e. specific exercises are selected) so

  In a paper by Sue Gordon 1 (1995) from the University of Sydney, the following metaphor is given:

  “The learning of statistics is like building a road. It’s a wonderful road, it will take you to places you did not think you could reach. But when you have constructed one bit of road you cannot sit back and think ‘Oh, that’s a great piece of road!’ and stop at that. Each bit leads you on, shows the direction to go, opens the opportunity for more road to be built. And furthermore, the part of the road that you built a few weeks ago, that you thought you were finished with, is going to develop potholes the instant you turn your back on it. This is not to be construed as failure on your part, this is not inadequacy. This is just part of road building. This is what learning statistics is about: go back and repair, go on and build, go back and repair.”

  A few logistical problems

Decimal comma or point?

  We realise that in the South African schooling system commas are used to indicate the decimal digit values. You have been penalised at school for using a point. Now we sit between two fires: the school system and common practice in calculators and computers! Most computer packages use the decimal point (ignoring the option to change it) and Keller (the author) also uses the decimal point in our textbook (Managerial Statistics). Thus, we shall use the decimal point in our study guide, assignments and the examination.

Role of computers and statistical calculators

  The emphasis in the textbook is well beyond the arithmetic of calculating statistics and the focus is on the identification of the correct technique, interpretation and decision making. This is achieved with a flexible design giving both manual calculations and computer steps.

  using a computer is that you can do calculations for larger and more realistic data sets. Whether you use a computer program or a statistical calculator as tool for your calculations is irrelevant to us. However, the emphasis in this module will always be on the interpretation and how to articulate the results in report writing. CD Appendixes and A Study Guide are provided on the CD-ROM (included in the textbook) in pdf format . Although it will not be to your disadvantage if you do not use the CD we encourage you to try your best to have at least a few sessions on a computer. Statistical software makes Statistics exciting – so, play around on the computer should you have access!

Key TermsSymbols

  Sampling distribution Central limit theorem Sampling distribution of the sample mean Standard error of the mean Normal approximation of the binomial distribution Continuity correction factor Sampling distribution of the sample proportion Standard error of the proportion Sampling distribution of the difference between two sample means Standard error of the difference between two means.

STUDY UNIT 1

1.1 Introduction

  The objective of Statistics is to draw conclusions about a population based on the limited information contained in a sample. In other word statistics is a method to convert data into information.

READ THROUGH

  Keller Chapter 1 What is Statistics?

  Introduction

  1.1 Key Statistical concepts

  1.2 Statistical Applications in Business

  1.3 Statistics and the Computer

  You need not panic when all the new terms do not make sense to you, neither need you remember them all at this stage! As we proceed chapter by chapter and you start applying the different techniques you will understand more. In study unit 5 you will learn most you need to know about data collection and sampling to obtain optimum information and you will learn that there are good and bad ways of obtaining a sample.

  Activity 1.1

  State whether the following statements are correct or incorrect. (a) When the purpose of the statistical inference is to draw a conclusion about a population, the

  significance level, as a measure of reliability, measures how frequently the conclusion will be

  Feedback

  (a) Incorrect (b) Correct (c) Incorrect (d) Correct (e) Incorrect

  (f) Incorrect (g) Correct (h) Incorrect

  (i) Correct

READ THROUGH

  Keller Chapter 1 What is Statistics?

  Appendix 1.A Instructions for the CD-ROM Appendix 1.A Instruction to Microsoft Excel

  At this stage we only want you to be aware of Microsoft Excel and of the fact that the CD contains an additional statistical software add-in for Excel which wiil enable you to do all the statistical procedures that is covered in the textbook. You can always come back to these pages when you need to know something more about Excel. We do not expect you to master MINITAB – mainly because it would imply additional costs to obtain

1.2 Types of Data and Information STUDY

  Keller Chapter 2 Graphical and Tabular Descriptive Techniques

  2.1 Types of Data and Information

Always remember that the type of data dictates what you can or can’t do in the rest of your

  data analysis. (Page to the inside cover of your textbook and notice that even the cryptic complete overall summary is organised according to data types.) Knowing about variable types and collecting data is almost like a chicken-and-egg situation! Which one comes first? It is extremely important to carefully think about the type of variable that each measurement represents, because it could influence the manner in which the measurements will be obtained. For example, suppose you compile a questionnaire where the respondent can tick one of the following age categories:

  Age

  Please Tick

  20–25 25–35 35–45 45–65

  The resulting data will be considered as ordinal measurements (i.e. if ages are artificially grouped into categories). But, if the true age of each respondent can be determined to the nearest day, the data can be considered as individual points on an interval scale and we have the strongest level of measurement.

Feedback

  (a) No. With nominal data, the observations are sorted into categories with no particular order to

  the categories. Since we may presume that “Better off” is ranked higher than “The same” and that “The same” is ranked higher than “Worse off” I would say that this is an example of ordinal measurements.

  (b) Yes. In general, categories are called mutually exclusive if each individual or object appears in

  only one category. It is clear that each individual can fall in only one of the three categories. You cannot be “better off” and “worse off” at the same time.

  (c) Yes. In general, categories are exhaustive if the categories cover all possible outcomes. In other

  words each individual must appear in at least one of the categories – almost like asking “Well, what else could your financial state be?” So, I would say yes.

  Activity 1.4

  At the end of an escorted travel bus vacation to Cape Town, the tour operator asks the vacationers to respond to the questions listed below. For each question, determine whether the possible responses are interval, nominal, or ordinal.

  (a) How many escorted vacations have you taken prior to this one? ....................................................

  ........................................................................................................................................................

  (b) Do you feel that your stay in Cape Town was sufficiently long? .......................................................

  Question 2

  A manufacturer of children toys claims that less than 3 of his products are defective. When 500 toys were drawn from a large production run, 5 were found to be defective.

  (a) Define the

  (i) population of interest (ii) sample (iii) parameter (iv) statistic

  (b) Does the value 3 refer to the parameter or the statistic? Why? (c) Does the value 5 refer to the parameter or the statistic? Why? (d) Explain briefly how the statistic can be used to make inferences about the parameter to test the

  claim.

  Question 3

  (a) The Human Resources Director of a large insurance company wishes to develop an employee

  health benefits package and decides to select 400 employees from a list of all employees in order to study their preferences for the various components of a potential package. The 400 employees who will participate in this study constitute the __________.

  (b) The Human Resources Director of a large hospital in Gauteng wishes to develop an employee

  health benefits package and decides to select 300 employees from a list of all employees in order to study their preferences for the various components of a potential package. Information obtained from the sample will be used to draw conclusions about the true population __________.

  Question 4

  A businessman who is running for the vacant City Mayor seat with 25 000 registered voters conducts

  Question 5

  (a) Provide one example of nominal data. (b) Provide one example of ordinal data. (c) Provide one example of interval data

1.4 Solutions to Self-correcting Exercises for Unit1

  Question 1

  (a) Descriptive statistics deals with methods of organizing, summarizing, and presenting data in a

  convenient and informative way. (b) Statistical inference is the process of estimation, prediction or decision making about a population,

  based on sample data. (c) The confidence level is a measure of reliability that measures the proportion of times that an

  estimating procedure will be correct. (d) The significance level is a measure of reliability that measures how frequently the conclusion

  about a population will be wrong in the long run. (e) A population is the group of all items of interest to a statistics practitioner. Populations are

  frequently very large and may, in fact, be infinitely large. (f) A sample is a set of data drawn from the population.

  Question 2

  (a) (i) The complete production run of toys.

  (ii) The 500 toys drawn from the production run. (iii) The proportion of the production run toys that are defective.

  Question 3

  (a) sample (b) parameters

  Question 4

  (a) The political choices of the 25 000 registered voters. (b) The political choices of the 500 registered voters interviewed. (c) It is a statistic, since it is a summary measure that is computed from the sample.

  Question 5

  (a) Nominal data example: Political party affiliation for American voters recorded using the code:

  1 = Democrat, 2 = Republican and 3 = Independent.

  (b) Ordinal data example: Response to market research survey measured on the Likert scale

  using the code: 1 = Strongly agree, 2 = Agree, 3 = Neutral,

  4 = Disagree and 5 = Strongly disagree.

  (c) Interval data example: Temperature on the rugby field during the Super Twelve competition.

  Key terms descriptive statistics statistical inference confidence level significance level population

1.5 Learning Outcomes

  Use the following learning outcomes as a checklist after you have completed this study unit to evaluate the knowledge you have acquired.

  Can you

  • define the following terms?

  - population - sample - parameter - statistic - statistical inference

  • describe the different types of variables?

  - nominal - ordinal - interval

1.6 Study Unit 1: Summary

  I. Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions.

  II. There are two types of statistics:

  A. Descriptive statistics are procedures used to organize and summarize data.

  B. Inferential statistics involve taking a sample from a population and making estimates about a population based on the sample results.

  1. A population is an entire set of individuals or objects of interest or the measurements

  obtained from all individuals or objects of interest.

  2. A sample is part of the population.

  III. There are two types of variables:

  A. A qualitatitve variable is categorical or nonnumeric.

  1. Usually we are interested in the number or percentage of observations in each category.

  2. Qualitative data are usually summarized in graphs and bar charts.

  B. There are two types of quantitative variables and they are usually reported numerically.

  1. Discrete variables can assume only certain values, and there are usually gaps between

  values.

  2. An interval variable can assume any value within a specified range.

  IV. There are two levels of measurement:

  A. With the nominal level, the data are sorted into categories with no particular order to the categories.

  B. The ordinal level of measurement presumes that one classification is ranked higher than another.

STUDY UNIT 2

2.1 Introduction

  We emphasised in study unit 1 how extremely important it was to carefully think about the type of variable that each measurement of a data set represents because the type dictates what you can or can’t do in the rest of your data analysis. Consciously remind yourself to think about the data type whenever you are busy doing something with data. The final mind map we are working towards (after completion of Statistics 1) and which you have to make part of yourself, is given on the inside cover of your textbook.

2.2 Graphical and Tabular Techniques to Describe Nominal

  Data

STUDY

  Keller Chapter 2 Graphical and Tabular Descriptive Techniques

  2.2 Graphical and Tabular Techniques to Describe Nominal Data

  You will not find one chart in the study guide (or textbook) that looks “manually drawn” simply because all printed matter needs to be in electronic format for the production process. In pre-computer days statisticians produced the same pie and bar charts manually and you should not feel discouraged if you do not have the software to produce them, as long as you understand how to construct them.

  Activity 2.1

  Voters participating in an election exit poll in Minnesota (USA) were asked to state their political party affiliation. Coding the data as 1 for Republican, 2 for Democrat, and 3 for Independent, the data collected were as follows:

  Construct a frequency bar chart.

  Feedback

  Manually

  We need a frequency table before we can draw a frequency bar chart.

  Tally marks

  Frequency

  Using Excel

  Frequency 4 2

  Activity 2.2

  Car buyers were asked to indicate the car dealer they believed offered the best overall service. The four choices were Carriage Motors (C), Marco Toyota (M), Triangle Auto (T) and University Chevrolet (U). The following data were obtained:

TCCCCMTCUUMCMTCMMCMU TCCTUMMCCTTUCUTMMCUT

  Construct a pie chart. Which car dealer offered the best overall service?

  Feedback

  It seems that Carriage Motors offered the best overall service.

  Major of Graduate Number of graduates Accounting

  (a) Draw a pie chart of the number of graduates. (b) Draw a frequency bar chart. (c) Which graph do you favour and why?

  Feedback

  (a) Pie chart

  (b) Frequency bar chart

  Frequency 20 10

  Activity 2.4

  Complete the following sentences:

  1. Bar and pie charts are graphical techniques for

  data. Bar charts focus the

  attention on the

  of the occurrences of the categories and pie charts emphasise

  the

  of occurrences of each category.

  2. If we wish to emphasise the relative frequencies for nominal data, we draw a

  3. One of the advantages of a pie chart is that it clearly shows that the total of all the categories of

  the pie adds to

  Feedback

  1. Bar and pie charts are graphical techniques for nominal data. Bar charts focus the attention on

  the frequency of the occurrences of the categories, and pie charts emphasise the proportion of occurrences of each category.

  2. If we wish to emphasise the relative frequencies for nominal data, we draw a pie chart.

  3. One of the advantages of a pie chart is that it clearly shows that the total of all the categories of

  the pie adds to 100.

2.3 Graphical Techniques To Describe Interval Data STUDY

Frequency distribution

  The first problem is to decide on the number of class intervals. There is an inverse relationship between the number of classes and the average frequency per interval. This means that a large number of classes could cause the frequencies of most classes to be small and that a small number of classes could cause the frequencies of most classes to be large. Too many intervals would not really summarise the data, whilst too few intervals would cause us to lose information. This is why Sturges’ formula comes in handy to determine the number of classes.

  The number of classes = 1 + 3.3log(n). (Table 2.6 of the textbook gives a rough summary of this rule.)

  The second step is to compute the size of the intervals or the interval width ( some textbooks talk about interval length). Since the total number of intervals must include all the data from the smallest observation to the largest observation it seems logical that the class width can be calculated using

  (largest value – smallest value)

  . Please note that this answer is seldom an integer, but then we compromise

  number of intervals

  and manipulate the result by rounding to an integer. The aim is to find an interval width that would also result in a convenient midpoint, especially if the histogram is drawn manually.

  The final step is to write down the classes. Keep in mind that we work with an interval type variable and we wouldn’t like to have gaps between the classes but on the other hand we also won’t like the classes to be ambiguous. This means there should be no doubt where an observation belongs. The classes should be mutually exclusive and no observation can be classified into two different intervals. Make sure that the smallest as well as the largest observation each falls into a class.

  Table 2.5 in Keller actually means

Class limits

  0 and less than 15

  15 and less than 30

  30 and less than 45 ↓

  105 and less than 120

  In table 2.5 of Keller (frequency distribution of long distance bills), he conveniently described the intervals as “0 to15”, “15 to30”, etc. – but strictly speaking we cannot mark the class limits as 0 – 15,

  15 – 30, 30 – 45, etc. because it could be ambiguous to interpret. (These values assume possible original values and we would not be sure where a long distance bill of exactly 15 was classified.)

  More advanced statistical packages avoid the confusion by picking the class midpoints halfway between the two class limits and use these values on the horizontal axis.

  Activity 2.5

  Complete the following sentences:

  1. A frequency distribution counts the number of observations that fall into each of a series of

  intervals, called

  that cover the complete

  2. Although the frequency distribution provides information about how the numbers in the data

  set are distributed, the information is more easily understood and imparted by drawing a

  3. The number of class intervals we select in a frequency distribution depends on the number of

  4. Select the correct option: The relative frequency of a class is computed by (a) dividing the frequency of the class by the class width (b) dividing the frequency of the class by the total number of observations in the data set (c) dividing the frequency of the class by the number of classes

  5. A modal class is the class that includes the

  6. When ogives or histograms are constructed, the

  axis must show the true zero or origin.

  10. A histogram is said to be positively skewed when it has a

  11. The stem-and-leaf display reveals (far more, far less) information relative to individual values than

  does the histogram.

  Feedback

  1. A frequency distribution counts the number of observations that fall into each of a series of

  intervals, called classes that cover the complete range of observations.

  2. Although the frequency distribution provides information about how the numbers in the data set

  are distributed, the information is more easily understood and imparted by drawing a histogram .

  3. The number of class intervals we select in a frequency distribution depends on the number of

  observations in the data set.

  4. The relative frequency of a class is computed by dividing the frequency of the class by the total

  number of observations in the data set.

  5. A modal class is the class that includes the largest number of observations.

  6. When ogives or histograms are constructed, the vertical axis must show the true zero or origin.

  7. According to Sturges’ rule, an indication of the number of class intervals in a frequency distribution

  equals 1 + 3.3 log ( n ), where n is the size of the data set.

  8. A bimodal histogram is one with two peaks, not necessarily equal in height.

  9. A histogram is said to be symmetric if, when we draw a vertical line down the centre of the

  histogram, the two sides are identical in shape and size.

  10. A histogram is said to be positively skewed when it has a long tail extending to the right.

  Activity 2.6

  The ages of a sample of 25 salespersons are as follows:

  (a) Draw a stem-and-leaf display. (b) Draw a histogram with four classes. (c) Draw a histogram with six classes.

  Feedback

  (a) Stem-and-leaf display

  (b) Histogram with four classes

  largest value – smallest value

  Interval width =

  59 − 21 38

  number of intervals =

  4 4 ≈ 10

  Class limits

  Number of salespersons

  (c) Histogram with six classes

  H istogram

  Activity 2.7

  Feedback

  (a) Using statistical software

  ive Frequency 0.6 0.5 0.4

  ive Relat 0.3 0.2 0.24 0.1

  Cumulat 0.0 0.00 20 30 40 50 60

  Ages (years)

  (b) 0.24 (c) The proportion of salespersons who are more than 40 years of age = 1 − 0.68 = 0.32 or 32. (d) The proportion of salespersons who are between 40 and 50 years of age = 0.92 − 0.64 = 0.28 or

2.4 Describing the Relationship between Two Variables and

  Describing Time Series Data STUDY

  Keller Chapter 2 Graphical and Tabular Descriptive Techniques

  Similarly, Keller conveys the idea that a graphical display of the joint study of two numerical variables should give you a feeling of the relationship between them. (When we have bivariate data on two numerical variables, we can also compute something additional which is a measure of the relationship between them. This will be formally treated in chapter 4 (and 17), where the relationship between them is quantified.)

  Activity 2.8

  (a) The graphical technique used to describe the relationship between

  is the scatter

  diagram. (b) Time series data are often graphically depicted on a

  , which is a plot of the

  variable of interest over time. (c) A line chart is created by plotting the value of the variable on the

  axis and the

  time periods on the

  axis.

  (d) In order to describe how two variables are related, the two most important characteristics revealed

  by the scatter diagram are the

  and

  of the relationship.

  (e) Data can be classified according to whether the observations are measured at the same time

  or whether they represent measurements at successive points in time. The former are called

  data and the latter,

  data.

  (f) To evaluate two categorical variables at the same time, a

  also called

  or

  should be developed.

  Feedback

  (d) In order to describe how two variables are related, the two most important characteristics revealed

  by the scatter diagram are the strength and direction of the relationship. (e) Data can be classified according to whether the observations are measured at the same time

  called cross-sectional data or whether they represent measurements at successive points in time, called time-series data.

  (f) To evaluate two categorical variables at the same time, a contingency table also called

  cross-classification table or cross-tabulation table should be developed.

  Activity 2.9

  A professor of economics wants to study the relationship between income and education. A sample of 10 individuals is selected at random. The data below shows their income (in R10 000 ) and education (in years).

  Education 12 14 10 11 13 8 10 15 13 12 Income 25 31 20 24 28 15 21 35 29 27

  a. Draw a scatter diagram for the data with the income on the vertical axis.

  b. Describe the relationship between income and education.

  Feedback

  (a)

  40 35 30 e 25

  om 20 Inc 15

  Activity 2.10

  A grocery store’s monthly sales (in thousands of dollars) for the last year were as follows:

  Month 1 2 3 4 5 6 7 8 9 10 11 12 Sales 78 74 83 87 85 93 100 105 103 89 78 94

  Construct a line chart for these data.

  Feedback

2.5 Self-correcting Exercises for Unit 2

  Question 1

  The number of defective items produced by a machine and recorded for the last 25 days are as follows:

  Question 2

  The grades on a statistics exam for a sample of 40 students are as follows:

  (a) Construct a stem-and-leaf display for these data. (b) Construct a frequency distribution and relative frequency distribution for these data, using seven

  class intervals. (c) Construct a relative frequency histogram for these data. (d) Describe briefly what the histogram and the stem-and-leaf display tell you about the data. (e) Construct a cumulative frequency and a cumulative relative frequency distribution.

  (f) What proportion of the grades is less than 60? (g) What proportion of the grades is more than 70? (h) Construct an ogive and estimate the proportion of grades that are between 80 and 90.

  Question 3

  After the midyear examinations at a residential university, a sample of 200 BCom students was taken. Students were asked whether they went barhopping the weekend before the midyear examinations started or spent the weekend studying, and whether they did well or poorly in the midyear examinations. The following table contains the results.

  Did Well in Midyear

  Did Poorly in Midyear

  Studied for Exam

  Went Barhopping

  (a) Of those in the sample who went barhopping the weekend before the midyear examination, what

  (e) If the sample is a good representation of the population, what percentage of the students in the

  population can we expect to spend the weekend studying and do poorly? (f) If the sample is a good representation of the population, what percentage of those who spend the

  weekend studying can we expect to do poorly in the midyear examination? (g) If the sample is a good representation of the population, what percentage of those who did poorly

  in the midyear examination can we expect to have spent the weekend studying?

2.6 Solutions to Self-correcting Exercises for Unit 2

  Question 1

  Class Limits Frequency Relative Frequency

  (a) Stem-and-leaf display for the data:

  (b) Frequency distribution and relative frequency distribution for these data, using seven class

  (c) Relative frequency histogram for the data

  0.3 0.25

  0.2 0.15 0.1

  Relative Frequency 0.05

  40 50 60 70 80 90 100 Grade

  (d) The distribution of the data is symmetrical and bell-shaped, with 67.5 of the observations

  between 50 and 80. (e) Cumulative frequency and a cumulative relative frequency distribution:

  Classes Cumulative Frequency Cumulative Relative Frequency

  (f) 0.35 (g) Proportion of the grades more than 70 = 1 − 0.625 = 0.375 (h) Ogive:

  0.9 0.950

  1.0 1.000

  0.8 0.825

  Key terms frequency table pie chart bar chart stem-and-leaf display histogram ogive univariate data bivariate data contingency table clustered bar chart scatterplot

2.7 Learning Outcomes

  Use the following learning outcomes as a checklist after you have completed this study unit to evaluate the knowledge you have acquired.

  Can you

  • compile and interpret a frequency table for nominal data? • present and interpret nominal data graphically using the following?

  - a pie chart - a bar chart

  • compile and interpret a frequency table for interval data? • present and interpret interval data graphically using the following?

  - a stem-and-leaf display - a histogram - an ogive

  • describe the difference between univariate and bivariate data? • compile and interpret a contingency table for bivariate nominal data? • present and interpret bivariate nominal data graphically using a

  - clustered bar chart? • present and interpret bivariate interval data graphically using a

  - scatterplot?

2.8 Study Unit 2: Summary

  I. A frequency table is a grouping of qualitative data into mutually exclusive classes showing the number of observations in each class.

  II. A relative frequency table shows the fraction of the number of frequencies in each class.

  III. A bar chart is a graphic representation of a frequency table.

  IV. A pie chart shows the proportion each distinct class represents of the total number of frequencies.

  V. A frequency distribution is a grouping of data into mutually exclusive classes showing the number of observations in each class.

  A. The steps in constructing a frequency distribution are as follows:

  1. Decide on the number of classes.

  2. Determine the class interval.

  3. Set the individual class limits.

  4. Tally the raw data into classes.

  5. Count the number of tallies in each class.

  B. The class frequency is the number of observations in each class.

  C. The class interval is the difference between the limits of two consecutive classes.

  D. The class midpoint is halfway between the limits of consecutive classes.

  VI A relative frequency distribution shows the percentage of observations in each class.

  VII. There are three methods for graphically portraying a frequency distribution.

  A. A histogram portrays the number of frequencies in each class in the form of a rectangle.

  B. A frequency polygon consists of line segments connecting the points formed by the intersection of the class midpoint and the class frequency.

  C. A cumulative frequency distribution shows the number or percentage of observations below given values.

STUDY UNIT 3

3.1 Introduction

  In the previous study unit you learned about the appropriate graphical and tabular techniques for nominal as well as interval data. The emphasis was on the techniques as such and we did not embroider on the pitfalls. It is important always to remember that the motivation behind graphs is that they add flavour and interest to data organisation. Graphical presentations most often catch

  a reader’s attention and are usually more easily interpreted than tables, but keep in mind that they never create new information. Graphs could have the effect of leading one to conclusions that are more extreme than the pure facts of a table! In fact, they could actually lead to mis-interpretations! Whenever you are in a decision-making situation you should train yourself to see through the visual image into the underlying set of facts. The proper (and safe) way to read a graph of any kind is to carefully think about the scales on the vertical and horizontal axes because “blowing up” of a scale could make differences look greater. The cheapest shot to try and “lie with statistics” is to deceive with a graph! To stress the importance of graphical excellence and the danger of possible graphical deception, Keller devotes a whole chapter (however short it might be!) to this topic.

3.2 Graphical Excellence and Graphical Deception STUDY

  Keller Chapter 3 Art and Science of Graphical Presentations

  3.1 Graphical Excellence

  Activity 3.1

  Select the correct option.

  Question 1

  You are less likely to be misled by a graph if you (a) focus your attention on the numerical values that the graph represents

  (b) avoid being influenced by the graph’s caption (c) ignore the scale used on the axes (d) do both (a) and (b)

  Question 2

  Possible methods of graphical deception include (a) a graph without a scale on one of the axes

  (b) stretching or shrinking of the vertical or the horizontal axis (c) a graph’s caption that influences the impression of the viewer (d) only absolute changes in value, rather than percentage changes, are reported (e) all of the above.

  Feedback

  Question 1

  ANSWER: option (d)

  Activity 3.2

  A municipality in Gauteng decided to fund construction of a playground at a local park. A childhood development research team, studying playground utilization, surveyed parents of toddlers as they exited the enclosed playground area. The following table shows, for five different play activities, the number of toddlers who played more than ten minutes at each activity. 80 parents reported the activities of 100 male toddlers and 70 female toddlers during a sunny day.

  Activity

  Male Toddlers

  Female Toddlers

  (a) Create a cluster bar chart showing, for each play activity, the fraction of all male toddlers (as a

  percentage of the total) who played on the activity for more than ten minutes, as compared to the fraction of female toddlers (as a percentage of the total). (Note that Americans call a “Seesaw” a “Teeter-Totter”.)

  (b) Define a toddler-play-unit as an instance of a toddler playing more than ten minutes on a single

  activity. Create a bar chart displaying the total number of male toddler-play-units for the playhouse and sandbox, versus, the total number of units for the slide and swing.

  Feedback

  (a) Cluster bar chart showing, for each play activity, the fraction of all male toddlers who played on

  each activity for more than ten minutes, as compared to the fraction of female toddlers.

  Percentage of Male vs. Female Toddlers Playing More Than Ten Minutes

  (b) Bar chart displaying the total number of male toddler-play-units for the playhouse and sandbox,

  versus, the total number of units for the slide and swing.

  Number of Male Toddler-Play-Units

  Slide Swing

  Sandbox Activity Combination

  Activity 3.3

  In a company’s 2000 report, it presented the following data regarding its sales (in millions of rand), and net income (in millions of rand) over the last five years.

  Net Income 1.6 5.2 4.1 2.4 7.1

  The following cluster bar chart could represent these data:

  Bar Chart for Sales and Net Income

  cy en

  u 100

Feedback

  An unscrupulous statistician could provide a cluster bar chart only for 1996, 1997 and 2000. It would then appear that there has been steady growth in sales and income over the years, because the declines in sales and income in 1998 and 1999 would not be evident as shown below:

  Bar Chart for Sales and Net Income

  50 Net Income

  Activity 3.4

  Cardiac patients arriving at the emergency room of a hospital usually receive a single dose of medication (containing aspirin) within 15 minutes of admission. The following graph visualises the number of cardiac patients receiving a single dose of aspirin within 15 minutes of admission to the emergency room.

  Aspirin Dose Within Fifteen Minutes of Admission 50

  n cy ue 30 eq Fr 20

Feedback

  As read from the graph, the various aspirin type counts are: 20 for baby, 40 for Ecotrin, 25 for others, and 15 for none. A bar graph showing the respective aspirin counts accurately must include a zero point on the vertical scale, as shown below:

  Aspirin Dose Within Fifteen Minutes of Admission

  Aspirin Type

  (b) The original bar graph, with vertical scale beginning at 10, displays bars distorting the arithmetic

  relationship between the aspirin type counts. Compare, for example, the Baby aspirin and Ecotrin in the original graph. The bar heigth (or area) representing the count for Baby aspirin seems to be 13 the height (or area) of the bar representing Ecotrin. Yet the numerical value (20) for

  Baby aspirin is 1 2 that of Ecotrin (numerical value = 40). One might ask: Why does this graphical

  distortion matter when the actual figures can be read from the vertical scale of the original graph? There are two reasons: (1) The visual features of a graphical display often create more powerful lingering impression than the data points and (2) When the data values read from the scale of

  a graph conflict with the impression displayed by visual features of the graph, a person might consider the value read from the scale as correct (leaving the visual feature incorrect) or the visual feature as correct (leaving the scale incorrect).

3.3 Presenting Statistics: Written Reports and Oral

  Representations READ THROUGH

  Keller Chapter 3 Art and Science of Graphical Presentations

  3.3 Presenting Statistics: Written Reports and Oral Presentations

  This section gives valuable tips on what you should do in case you need to write a report or give an oral presentation. Although this section is valuable in most work situations where statistics is applied, we will not examine you explicitly on it.

3.4 Measures of Central Location STUDY

  Keller Chapter 4 Numerical Descriptive Techniques

  4.1 Measures of Central Location

  It is extremely important that you know how to compute the sample mean, x , and that you feel comfortable with the mathematical expression: 1

  x=

  x i . The mean plays an important part in

  n i=1

  many of the statistical analyses you will encounter in Statistical Inference I, i.e. in STA1502.

  Activity 3.5

  Question 2

  Which measure of central location is meaningful when the data are nominal? (a) The arithmetic mean

  (b) The geometric mean (c) The median (d) The mode

  Question 3

  Which measure of central location is appropriate whenever we wish to find the average growth rate or rate of change in a variable over time?

  (a) The arithmetic mean (b) The geometric mean (c) The median (d) The mode

  Question 4

  Which of the following statements about the arithmetic mean is only true in special cases? (a) The sum of the deviations from the mean is zero.

  (b) Half of the observations are on either side of the mean. (c) The mean is a measure of the middle (centre) of a distribution. (d) The value of the mean times the number of observations equals the sum of all observations.

  Question 5

  Question 6

  Which of the following statements is true? In a positively-skewed distribution,

  (a) the median equals the mean (b) the median is less than the mean (c) the median is larger than the mean (d) the mean, median and mode are equal

  Question 7

  Which of the following statements about the median is not true? (a) It is more affected by extreme values than the mean.

  (b) It is a measure of central tendency. (c) It is equal to the observation that falls in the middle when all observations are placed in ascending

  or descending order. (d) It is equal to the mode in a bell-shaped “normal” distribution.

  Question 8

  Which of the following summary measures is the easiest to compute? (a) The mean

  (b) The median (c) The mode (d) All of the above

  Feedback

  Activity 3.6

  Feedback

  It is easier to rewrite the 25 observations in the following format before we compute the measures:

  25 (ii) median = 1 pet

  (Median is the average of the 12 th and 13 th values, and has the value of 1+1

  (iii) mode = 1 pet

  (Value with highest frequency)

  (b) (i) The “average” number of pets owned was 1.80 pets.

  (ii) Half the families own at most one pet, and the other half own at least one pet. (iii) The most frequent number of pets owned was one pet.

  Activity 3.7

  Suppose you make a two-year investment of R5,000 and it grows by 100 to R10,000 during the first year. During the second year, however, the investment suffers a 50 loss, from R10,000 back to R5,000.

  Feedback

  (R 1 +R 2 (a) The arithmetic mean: )

  (b) The geometric mean: R g = 2 (1 + R 1 )(1 + R

  2 2 )−1= (1 + 1)(1 + (−0.5) − 1 = 0

  (c) The value of the arithmetic mean is misleading. Because there was no change in the value of the

  investment from the beginning to the end of the two-year period, the “average” compounded rate

  of return is in effect 0 , and this is the value of the geometric mean. The geometric mean makes

  more sense.

3.5 Measures of Variability STUDY

  Keller Chapter 4 Numerical Descriptive Techniques

  4.2 Measures of Variability

  In chapter 2 Keller conveys the idea that a graphical display of a univariate data set (i.e. a numerical variable one at a time) can give you an immediate feeling of how the variable behaves. We can almost “see” what the average or central value of the variable, as well as the the spread of the variable are. In this chapter Keller formally defines these measures. The measure of spread quantifies the variability.

  It is extremely important that you know how to compute the sample variance, s 2 , and that you

  Activity 3.8

  Question 1

  Select the correct option: If two data sets have the same range,

  (a) the distance from the smallest to largest observations in both sets will be the same (b) the smallest and largest observations are the same in both sets (c) both sets will have the same standard deviation (d) both sets will have the same interquartile range

  Question 2

  Select the correct option: The Empirical Rule states that the approximate percentage of measurements in a data set (providing that the data set has a bell-shaped distribution) that fall within two standard deviations of its mean is approximately

  (a) 68 (b) 75 (c) 95 (d) 99

  Question 3

  Which of the following summary measures is affected most by outliers? (a) The median