Analysis of variance ANOVA 001
Analysis of variance (ANOVA)
ANOVA is a statistical technique that assesses potential differences in a scale-level dependent variable by a nominal-level variable having 2 or more categories. For example, an ANOVA can examine potential differences in IQ scores by Country (US vs. Canada vs. Italy vs. Spain). The ANOVA, developed by Ronald Fisher in 1918,
extends the t and the z test which have the problem of only allowing the nominal level variable to have just two categories. This test is also called the Fisher analysis of variance.
ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t -test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables)
There are three classes of models used in the analysis of variance, and these are outlined here.
Fixed-effects models
The fixed-effects model (class I) of analysis of variance applies to situations in which the experimenter applies one or more treatments to the subjects of the experiment to see whether the response
variable values change. This allows the experimenter to estimate the ranges of response variable values that the treatment would generate in the population as a whole.
Random-effects models Random effects model (class II) is used when the treatments are not fixed. This occurs when the various factor levels are sampled from a larger population. Because the levels themselves are random variables, some assumptions and the method of contrasting the treatments (a multi-variable
generalization of simple differences) differ from the fixed-effects model.
Mixed-effects models[
A mixed-effects model (class III) contains experimental factors of both fixed and random-effects types, with appropriately different interpretations and analysis for the two types.
Example: Teaching experiments could be performed by a college or university department to find a good introductory textbook, with each text considered a treatment. The fixed-effects model would compare a list of candidate texts. The random-effects model would determine whether important differences exist among a list of
(2)
randomly selected texts. The mixed-effects model would compare the (fixed) incumbent texts to randomly selected alternatives. Defining fixed and random effects has proven elusive, with
competing definitions arguably leading toward a linguistic quagmire
Characteristics of ANOVA
ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two
variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a
constant does not alter significance. So ANOVA statistical
significance result is independent of constant bias and scaling errors as well as the units used in expressing observations. In the era of mechanical calculation it was common to subtract a constant from all observations (when equivalent to dropping leading digits) to simplify data entry. This is an example of data coding.
Example:
An experiment is performed at a company to compare three different types of food. Three types of food– Chinese, Italian, and Mexican – are tried, on four days that are selected randomly for each. The productivity of each day which is measured by the
number of items that are being produced is recorded and he results are given in the table below:
Chine se
Itali an
Mexic an 857 701 824 801 753 847 795 781 881 842 776 865
1) Is it possible to conclude from this information that the mean number of the produced items differs for at least two types of food
(3)
out of the three? Use a = .05.
2) Explain what the p-value that is found in part A means.
3) Which type(s) of food seem to be best?
4) Which type(s) of food seem to be worst?
Solution:
1) The parameters of the interest are m_1 which is for the mean number of the produced items for all days when Chinese food is eaten; m_2 which is for the mean number of the produced items for all days when Italian food is eaten; and m_3 which is for the mean number of the produced items for all days when Mexican food is eaten.
H0 : m1 = m2 = m3 Ha : inequality of at least two of the means
out of three.
The Decision Rule : Accept Ha if the p-value that is calculated is less
than 0.05.
Test statistic, F = (variability among the sample means) / (variability due to chance)
The F – Value = 10.66 and p – value = 0.0042 derived from the calculations from the StatCrunch. It is clear that p-value < .05 Ha is to be accepted.
We can interpret that at the level of significance of 0.05 the mean number of produced items differs for at least two types of food out of all three.
2) If the mean productivity were to be the same for all three types of food that is the null hypothesis holds true, then it is clear that the probability of observation of the three sample means as varied, or more varied, as the one that is obtained in this experiment is equal to 0.0042.
(4)
that the sample means would be having such diverse values if all the population means are considered equal. This is the reason that the alternative hypothesis was accepted.
3) The results that are obtained via the multiple comparison tests that were conducted for the mean numbers of items produced for each of the three types of food are shown in the table below.
Interpretations are listed as in terms of which of them is the largest because "best", here, means largest.
Comparison Value
of t
p-value
Interpretat ion
Chinese vs.
Italian 2.81 0.020
Chinese>Ita lian
Chinese vs
Mexican -1.77 0.110 NS Italian vs.
Mexican -4.58 0.001
Mexican>Ita lian
From these analyses it is clear that Italian food is certainly not the best in terms of worker productivity. Mexican food may be best, but perhaps even Chinese food could be.
4) From the same analyses we can see that Italian food is the worst in terms of worker productivity.
Coefficient of Correlation
Correlation:The relationship between more than one variable is considered as correlation. Correlation is considered as a number which can be used to describe the relationship between two variables. Simple correlation is defined as a variation related amongst any two variables.
(5)
related variation among three or more variables. Two variables are correlated only when they vary in such a way that the higher and lower values of one variable corresponds to the higher and lower values of the other variable. We might also get to know if they are correlated when the higher value of one variable corresponds with the lower value of the other
Coefficient of Correlation
Coefficient of correlation, r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. It also called as Pearson product moment correlation coefficient. The algebraic method of measuring the correlation is called the coefficient of correlation.
Types of correlation coefficients include:
Pearson product-moment correlation coefficient, also known
as r, R, or Pearson's r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.
Intraclass correlation, a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups; describes how strongly units in the same group resemble each other.
Rank correlation, the study of relationships between rankings of different variables or different rankings of the same variable
Spearman's rank correlation coefficient, a measure of how well the relationship between two variables can be described by a monotonic function
Kendall tau rank correlation coefficient, a measure of the portion of ranks that match between two data sets.
Goodman and Kruskal's gamma, a measure of the
strength of association of the cross tabulated data when both variables are measured at the ordinal level.
(6)
Types of Correlattion 1.Positive correlation
A positive correlation is a correlation in the same direction. 2. Negative Correlation
A negative correlation is a correlation in the opposite direction.
3. Partial Correlation
The correlation is partial if we study the relationship between two variables keeping all other variables constant.
Example:
The Relationship between yield and rainfall at a constant temperature is partial correlation.
4. Linear Correlation
When the change in one variable results in the constant change in the other variable, we say the correlation is linear. When there is a linear correlation, the points plotted will be in a straight line
Example:
Consider the variables with the following values. X
: 10 20 30 40 50 Y: 20 40 60 80 100
Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points. Also, if we plot them they will be in a straight line
Correlation are of three types: Positive Correlation
(7)
Negative Correlation No correlation
In correlation, when values of one variable increase with the increase in another variable, it is supposed to be a positive correlation. On the other hand, if the values of one variable decrease with the decrease in another variable, then it would be a negative correlation. There might be the case when there is no change in a variable with any change in another variable. In this case, it is defined as no correlation between the two.
Correlation Symbol
Symbol of correlation = rr
Correlation Formula
The formula for correlation is as follows,
Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2− (∑Y)2]√N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]
Where,
xx and yy are the variables.
bb = the slope of the regression line is also called as the regression coefficient
aa = intercept point of the regression line which is in the y-axis.
NN = Number of values or elements
(8)
YY = Second Score
∑XY∑XY = Sum of the product of the first and Second Scores ∑X∑X = Sum of First Scores
∑Y∑Y = Sum of Second Scores
∑X2∑X2 = Sum of square first scores. ∑Y2∑Y2 = Sum of square second scores.
r = n∑xy−∑x∑yn∑x2−(∑x)2√n∑y2−(∑y)2√n∑xy−∑x∑yn∑x2− (∑x)2n∑y2−(∑y)2
1. Positive Correlation
A positive correlation is a correlation in the same direction.
2. Negative Correlation
A negative correlation is a correlation in the opposite direction.
3. Partial Correlation
The correlation is partial if we study the relationship between two variables keeping all other variables constant.
Example:
The Relationship between yield and rainfall at a constant temperature is partial correlation.
(9)
When the change in one variable results in the constant change in the other variable, we say the correlation is linear. When there is a linear correlation, the points plotted will be in a straight line
Example:
Consider the variables with the following values. X : 1 0 2 0 3 0 4 0 50
Y: 2 0 4 0 6 0 8 0 10 0
Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points. Also, if we plot them they will be in a straight line.
Positive Correlation
Back to Top
A relationship between two variables in which both variables move in same directions. A positive correlation exists when as one variable decreases, the other variable also decreases and vice versa. When the values of two variables x and y move in the same direction, the correlation is said to be positive. That is in positive correlation, when there is an increase in x, there will be and an increase in y also. Similarly when there is a decrease in x, there will be a decrease in y also.
Positive Correlation Example
(10)
When Price increases, supply also increases; when price decreases, supply decreases.
Positive Correlation Graph
Strong Positive Correlation
A strong positive correlation has variables that has the same changes, but the point are more close together and form a line.
(11)
Weak Positive Correlation
A weak positive correlation has variables that has the same changes but the points on the graph are dispersed.
(12)
Negative Correlation
Back to Top
In a negative correlation, as the values of one of the variables
increase, the values of the second variable decrease or the value of one of the variables decreases, the value of the other variable
increases. When the values of two variables x and y move in
opposite direction, we say correlation is negative. That is in negative correlation, when there is an increase in x, there will be a decrease in y. Similarly when there is a decrease in x, there will be an increase in y increase.
Negative Correlation Example
When price increases, demand also decreases; when price decreases, demand also increases. So price and demand are negatively correlated.
(13)
Perfect Negative Correlation
The closer the correlation coefficient is either -1 or +1, the
stronger the relationship is between the two variables. A perfect negative correlation of -1.0 indicated that for every member of the sample, higher score on one variable is related to a lower score on the other variable.
Solved Example Question:
To determine the correlation value for the given set of X and Y values:
X Values Y Values
21 2.5
23 3.1
37 4.2
19 5.6
24 6.4
(14)
Solution:
Let us count the number of values. N = 6
Determine the values for XY, X2 , Y2
X Value Y Value X*Y X*X Y*Y
21 2.5 52.5 441 6.25
23 3.1 71.3 529 9.61
37 4.2 155.4 1369 17.64
19 5.6 106.4 361 31.36
24 6.4 153.6 576 40.96
33 8.4 277.2 1089 70.56
Determine the following
values ∑X∑X , ∑Y∑Y , ∑XY∑XY , ∑X2∑X2 , ∑y2∑y2. ∑X=157∑X=157
∑Y=30.2∑Y=30.2
∑XY=816.4∑XY=816.4 ∑X2=4365∑X2=4365 ∑Y2=176.38∑Y2=176.38
Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√
Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√N∑XY− (∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]
(15)
(r)=0.33
Regression Line
Definition: The Regression Line is the line that best fits the data, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest. In other words, a line used to minimize the squared deviations of predictions is called as the regression line
Regression is concerned with the study of relationship among
variables. The aim of regression (or regression analysis) is to make models for prediction and for making other inferences. Two variables or more than two variables may be treated by regression.
Regression line usually written as Yˆ=a+bXY^=a+bX. The general properties of the regression line Yˆ=a+bXY^=a+bX are given below:
o We know that Y¯¯¯¯=a+bX¯¯¯¯Y¯=a+bX¯. This shows that the line passes through the means X¯¯¯¯X¯ and Y¯¯¯¯Y¯.
o The sum of errors is equal to zero. The regression equation is Yˆ=a+bXY^=a+bX and the sum of derivatives of
observed YY from estimated YˆY^ is
∑(Y−Yˆ)=∑(Y−a−bX)=∑Y−na−b∑X=0∑(Y−Y^)=∑(Y−a−bX)=∑Y−na −b∑X=0
[∑Y=na+b∑X][∑Y=na+b∑X]
When ∑(Y−Yˆ)=0∑(Y−Y^)=0, it means that ∑Y=∑Yˆ∑Y=∑Y^
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the
(16)
Stude
nt xi yi
(xi - x)
(yi - y)
(xi - x)2
(yi - y )2
(xi - x)(yi - y)
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
5 60 70 -18 -7 324 49 126
Su m
39 0
38
5 730 630 470
Mea
n 78 77
The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression analysis, we need to solve for b0 and b1. Computations are shown below.
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = 470/730 = 0.644
b0 = y - b1 * x b0 = 77 - (0.644)(78) =
26.768
(1)
Weak Positive Correlation
A weak positive correlation has variables that has the same changes but the points on the graph are dispersed.
(2)
Negative Correlation Back to Top
In a negative correlation, as the values of one of the variables
increase, the values of the second variable decrease or the value of one of the variables decreases, the value of the other variable
increases. When the values of two variables x and y move in
opposite direction, we say correlation is negative. That is in negative correlation, when there is an increase in x, there will be a decrease in y. Similarly when there is a decrease in x, there will be an increase in y increase.
Negative Correlation Example
When price increases, demand also decreases; when price decreases, demand also increases. So price and demand are negatively correlated.
(3)
Perfect Negative Correlation
The closer the correlation coefficient is either -1 or +1, the
stronger the relationship is between the two variables. A perfect negative correlation of -1.0 indicated that for every member of the sample, higher score on one variable is related to a lower score on the other variable.
Solved Example Question:
To determine the correlation value for the given set of X and Y values:
X Values Y Values
21 2.5
23 3.1
37 4.2
19 5.6
24 6.4
(4)
Solution:
Let us count the number of values. N = 6
Determine the values for XY, X2 , Y2
X Value Y Value X*Y X*X Y*Y
21 2.5 52.5 441 6.25
23 3.1 71.3 529 9.61
37 4.2 155.4 1369 17.64
19 5.6 106.4 361 31.36
24 6.4 153.6 576 40.96
33 8.4 277.2 1089 70.56 Determine the following
values ∑X∑X , ∑Y∑Y , ∑XY∑XY , ∑X2∑X2 , ∑y2∑y2. ∑X=157∑X=157
∑Y=30.2∑Y=30.2
∑XY=816.4∑XY=816.4 ∑X2=4365∑X2=4365 ∑Y2=176.38∑Y2=176.38
Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√
Correlation (r) = N∑XY−(∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]√N∑XY− (∑X)(∑Y)[N∑X2−(∑X)2][N∑Y2−(∑Y)2]
(5)
(r)=0.33
Regression Line
Definition: The Regression Line is the line that best fits the data, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest. In other words, a line used to minimize the squared deviations of predictions is called as the regression line
Regression is concerned with the study of relationship among
variables. The aim of regression (or regression analysis) is to make models for prediction and for making other inferences. Two variables or more than two variables may be treated by regression.
Regression line usually written as Yˆ=a+bXY^=a+bX. The general properties of the regression line Yˆ=a+bXY^=a+bX are given below:
o We know that Y¯¯¯¯=a+bX¯¯¯¯Y¯=a+bX¯. This shows that the line passes through the means X¯¯¯¯X¯ and Y¯¯¯¯Y¯.
o The sum of errors is equal to zero. The regression equation is Yˆ=a+bXY^=a+bX and the sum of derivatives of
observed YY from estimated YˆY^ is
∑(Y−Yˆ)=∑(Y−a−bX)=∑Y−na−b∑X=0∑(Y−Y^)=∑(Y−a−bX)=∑Y−na −b∑X=0
[∑Y=na+b∑X][∑Y=na+b∑X]
When ∑(Y−Yˆ)=0∑(Y−Y^)=0, it means that ∑Y=∑Yˆ∑Y=∑Y^
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the
(6)
Stude
nt xi yi
(xi - x) (yi - y) (xi - x)2
(yi - y )2
(xi - x)(yi - y) 1 95 85 17 8 289 64 136 2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96 5 60 70 -18 -7 324 49 126 Su
m
39 0
38
5 730 630 470
Mea
n 78 77
The regression equation is a linear equation of the form: ŷ = b0 + b1x . To conduct a regression analysis, we need to solve for b0 and b1. Computations are shown below.
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = 470/730 = 0.644
b0 = y - b1 * x b0 = 77 - (0.644)(78) =
26.768