Business Statistics Unit 9

Chi-Square Analysis and Testing, Analysis of Variance (ANOVA)

Introduction


In this unit you are introduced to another new statistical technique called Chi-square analysis and testing.  There are two possible tests we can use:  one is called a goodness-of-fit test and the other is called a contingency table test.  The Chi-square tests do not require a normal distribution and are often used when the shape of the population distribution is unknown.  Chi-square tests are appropriate when you have nominal data.  Nominal data is the most basic form of data where data is classified into categories with no order.  Examples of nominal data include gender, state of residence, brand of toilet paper you use, or the type of car you drive.

Later in this unit you will learn about Analysis of Variance (ANOVA).  The Analysis of Variance test (ANOVA) is used to compare the means between three or more normal populations to determine if they are equal or not.

Basic Thoughts and Terms


A goodness-of-fit test is used to determine under hypothesis testing, if our observed outcomes from an experiment are different from the expected outcomes from the experiment.  For example, we can use a goodness-of-fit test to determine if a new fast food restaurant had an impact on the sales of surrounding existing fast food restaurants after it opened.  In this case, our expected outcomes would be that no change has occurred in terms of sales at the existing fast food restaurants, and our observed outcomes are the actual sales at all fast food restaurants, including the new one.

The contingency table test is used to determine if two or more variables are independent of each other, or that one has a statistically significant effect on the other.  For example, if we wanted to determine the taste preferences of men and women related to a new food product, we could use a contingency table test to determine if gender was a factor related to taste preference or not.

The Goodness-Of-Fit Test


For a goodness-of-fit test, we use the familiar five step process that we have used in earlier units.  For each step however, some of the procedures are different.

  1. State the null and alternative hypothesis.  For the null hypothesis (H0 ), it is usually stated as there has been no change or preference and that the population being studied fits a specified pattern or distribution.  The alternative hypothesis (H1) is that the population does not fit a specified pattern or preference, or that there has been a change.

  2. Find the critical value.  To find the critical value for a chi-square distribution using the tables in the back of your book, we must first find the degrees of freedom.  For this test, the degrees of freedom are equal to k - 1, where k is the number of possible outcomes in an experiment.  The possible outcomes are often found by looking at the different categories the data has been classified into.  Once you have the degrees of freedom, you use the significance value and use the chi-square table to find the critical value.  Note:  you will only need to use the right columns in the table.  For example, the .05, .01, and .10 columns are often used.  All chi-square tests are considered one-tail tests.  It is also a good idea to state your decision rule to guide you in what you should do after you calculate your test statistic.  The same basic format is used.  Reject H0 in favor of H1 if our test statistic X2 (calculated X2 value in step 3) is > the critical value from step 2, otherwise fail to reject H0.  The decision rule for testing will always use the > symbol.

  3. Compute the test statistic.  The test statistic for chi-square goodness-of-fit testing can be found in Formula Box 9.1.

Formula Box 9.1.  The Chi-Square Goodness-Of-Fit Formula.

Chi-square goodness of fit formula

O = an observed frequency from the sample

E = the expected (hypothesized) frequency if the null hypothesis is true.

  1. Make the decision.  If our calculated value for the test statistic in step 3 is greater than the critical value found in step 2, we reject the null hypothesis (H0).  If our test statistic value is less than the critical value, we fail to reject the null hypothesis (H0).

  2. Summarize the resultsFor this step you once again state what decision should be made in relation to the question that was posed in the problem.

Ready to try a practice problem?  Lets go to Practice Exercise 9.1.

Practice Exercise 9.1.  The Goodness-Of-Fit Test.


The president of a small local college is interested in determining if the number of college graduates from the college's four major schools has changed during the last 10 years.  In the past,  the school of art had 15% of the graduates, the school of business had 25% of the graduates, the school of education had 40%, and the school of engineering had 20% of the graduates.  The number of recent graduates from each of the four schools is given below:

Art

30

Business

55

Education

25

Engineering

40

 At the .05 level of significance, does a goodness-of-fit test indicate that there has been a change in the number of graduates by each school in the last 10 years?

  1. State the null and alternative hypothesis.  The null hypothesis is stated as:  H0 :  There has been no change in the past 10 years in the number of graduates from each school.  The alternative hypothesis is stated as:  H1:  There has been a change.

  2. Find the critical value.  Significance (α) = .05.  The critical value is based upon k - 1 degrees of freedom.  There are four possible outcomes so k = 4.  Our degrees of freedom equal 4 - 1 = 3.  The critical value with 3 degrees of freedom and .05 significance equals 7.81.  The decision rule:  Reject H0 in favor of H1 if the test statistic is > 7.81.  Otherwise fail to reject H0.

  3. Compute the test statistic.  Here we use the information given above along with formula 9.1 to solve.  Notice that the expected (E) column is based upon the past percentages for each school, multiplied by the total number of graduates.

Observed (O) Expected (E) O - E (O - E)2 (O - E)2 / E
30 22.50 7.50 56.25 2.50
55 37.50 17.50 306.25 8.1667
25 60.00 -35.00 1225 20.4167
40 30.00 10.00 100 3.3333

Total  

34.4167

The calculated test statistic value is 34.4167.

  1. Make the decision.  Since our test statistic value of 34.4167 is greater than our critical value of 7.81, we reject H0

  2. Summarize the results.  This indicates that a statistically significant change has occurred in the number of graduates by school for the college.


By now you may have noticed that the calculations for the goodness-of-fit test can be quite time consuming although they are not difficult to do.  Is there a better way?  Sure!  Let's use Excel to find the same answer to Practice Exercise 9.1.

Practice Exercise 9.2.  Using Excel 2007 or 2003 for the Goodness-Of-Fit Test.


The president of a small local college is interested in determining if the number of college graduates from the college's four major schools has changed during the last 10 years.  In the past,  the school of art had 15% of the graduates, the school of business had 25% of the graduates, the school of education had 40%, and the school of engineering had 20% of the graduates.  The number of recent graduates from each of the four schools is given below:

Art

30

Business

55

Education

25

Engineering

40

 At the .05 level of significance, does a goodness-of-fit test indicate that there has been a change in the number of graduates by each school in the last 10 years?

  1. State the null and alternative hypothesis.  The null hypothesis is stated as:  H0 :  There has been no change in the past 10 years in the number of graduates from each school.  The alternative hypothesis is stated as:  H1:  There has been a change.

  2. Find the critical value.  Significance (α) = .05.  The critical value is based upon k - 1 degrees of freedom.  There are four possible outcomes so k = 4.  Our degrees of freedom equal 4 - 1 = 3.  The critical value with 3 degrees of freedom and .05 significance equals 7.81.  The decision rule:  Reject H0 in favor of H1 if the test statistic is > 7.81.  Otherwise fail to reject H0.

  3. Compute the test statistic.  Here is where we can use Excel to make our job easier.  First of all make sure you have a blank spreadsheet opened in Excel.

Art 30 22.5 2.500000
Business 55 37.5 8.166667
Education 25 60 20.416667
Engineering 40 30 3.333333
      34.416667
  1. Make the decision.  Since our test statistic value of 34.4167 is greater than our critical value of 7.81, we reject H0

  2. Summarize the results.  This indicates that a statistically significant change has occurred in the number of graduates by school for the college.

Although Excel does make our calculations easier to complete, it is not able to directly compute the Chi-Square test statistic from a set of data variables.  There is a command available in Excel called CHITEST which is able to compute a P value for a Chi-square goodness-of-fit test only.  However in the past many students attempted to use this test for a contingency table test and also became confused about the P value.  We are not using this method to find the Chi-square value.

The Contingency Table Test


The contingency table test is used to determine whether or not two or more variables are independent.  Let's review the five step process and look at the changes for the contingency table test.

  1. State the null and alternative hypothesis.  For the null hypothesis (H0 ), it is usually stated as the populations being examined are independent of each other.  In other words, for the factor being examined, all of the populations are independent of each other.  The alternative hypothesis (H1) is that the populations are not independent of each other.

  2. Find the critical value.  To find the critical value for a chi-square distribution, we must first find the degrees of freedom.  For this test, the degrees of freedom are found by using the following formula:  df = (r - 1)(c - 1).  The letter r represents the number of rows in the table and the value c represents the number of columns.  Note:  you will only need to use the right columns in the table.  For example, the .05, .01, and .10 columns are often used.  All chi-square tests are considered one-tail tests.  It is also a good idea to state your decision rule to guide you in what you should do after you calculate your test statistic.  The same basic format is used.  Reject H0 in favor of H1 if our test statistic X2 (calculated X2 value in step 3) is > the critical value from step 2, otherwise fail to reject H0.  The decision rule for testing will always use the > symbol.

  3. Compute the test statistic.  The test statistic for chi-square contingency table testing can be found in Formula Box 9.2.  In Formula Box 9.3, the formula for finding the expected values (E) for each cell of a contingency table test is given.

Formula Box 9.2.  The Chi-Square Contingency Table Test.

Chi-square contingency table formula

Notice that the formula for the contingency table test is the same one used for the goodness-of-fit test.  The difference in our procedure is related to finding the expected values (E).  In the goodness-of-fit test, the value of E for each cell was determined from prior percentages given in the problem.  For a contingency table test, the value of E for each cell is determined by using Formula 9.3 below.

Formula Box 9.3.  The Chi-Square Expected Value Formula for a Contingency Table Test.

Expected value formula for E in a contingency table test

This formula is applied by taking the totals from each row and column in a contingency table, multiplying them together, and dividing by the grand total of all rows and columns.

  1. Make the decision.  If our calculated value for the test statistic in step 3 is greater than the critical value found in step 2, we reject the null hypothesis (H0).  If our test statistic value is less than the critical value, we fail to reject the null hypothesis (H0).

  2. Summarize the results.  Explain your decision here.

It's time to practice solving a contingency table problem.

Practice Exercise 9.3.  The Chi-Square Contingency Table Test.


A survey based upon a random sample of 1000 people was taken by a political research company interested in finding out the public's opinion of the federal deficit and what should be done about it.  People were asked if the government should reduce the deficit, increase the deficit, or if they had no opinion.  The results of the study by gender is given below.

Gender Reduce the Deficit Increase the Deficit No Opinion Total
Female 250 220 70 540
Male 310 120 30 460
Total 560 340 100 1000

At the .05 level, test the hypothesis that the opinion of those surveyed in regards to the deficit is independent of gender.

  1. The null hypothesis is stated as:  H0 :  The opinion about the budget deficit is independent of gender.  The alternative hypothesis is stated as:  H1:  The variables are not independent.

  2. The critical value is based upon (r - 1)(c - 1) degrees of freedom.  We have two rows of data and three columns of data.  Therefore the degrees of freedom equal:  (2 - 1)(3 - 1) = 2.  The critical value with a significance of .05 and 2 degrees of freedom equals 5.99.  The decision rule:  Reject H0 in favor of H1 if the test statistic is > 5.99.  Otherwise fail to reject H0.

  3. Compute the test statistic.  Here we use the information given above along with formulas 9.2 and 9.3 to solve.  Our first step is to create a table and list all of the observed values.  You simply take the observed values from the previous table, and insert them into the table below starting with the first column of data and moving to the right.

Observed (O) Expected (E) O - E (O - E)2 (O - E)2 / E
250        
310        
220        
120        
70        
30        

Our next step is to calculate the expected values (E).  In order to do this, you use formula 9.3 for each cell.  Notice below that the values for each cell have been inserted into the formula.

Observed (O) Expected (E) O - E (O - E)2 (O - E)2 / E
250 (540)(560) / 1000 = 302.40      
310 (460)(560) / 1000 = 257.60      
220 (540)(340) / 1000 = 183.60      
120 (460)(340) / 1000 = 156.40      
70 (540)(100) / 1000 = 54      
30 (460)(100) / 1000 = 46      

The next step we take is to subtract the expected values (E) from the observed values (O), square the result, and then divide it by E for each cell.

Observed (O) Expected (E) O - E (O - E)2 (O - E)2 / E
250 (540)(560) / 1000 = 302.40 -52.40 2745.76 9.0799
310 (460)(560) / 1000 = 257.60 52.40 2745.76 10.6590
220 (540)(340) / 1000 = 183.60 36.40 1324.96 7.2166
120 (460)(340) / 1000 = 156.40 -36.40 1324.96 8.4716
70 (540)(100) / 1000 = 54 16.00 256.00 4.7407
30 (460)(100) / 1000 = 46 -16.00 256.00 5.5652

Total

45.733
  1. Make the decision.  Since our test statistic value of 45.733 is greater than our critical value of 5.99, we reject H0.

  2. Summarize the results.  By rejecting the null hypothesis, this indicates that people's opinion about the deficit is not independent of gender.


So once again manually calculating the contingency table test is not difficult, but it is time consuming.  Can Excel help us?  Yes, let's take a look at this problem again but this time we will use Excel to find our test statistic value.

Practice Exercise 9.4.  Using Excel 2007 or 2003 to Solve a Contingency Table Problem.


A survey based upon a random sample of 1000 people was taken by a political research company interested in finding out the public's opinion of the federal deficit and what should be done about it.  People were asked if the government should reduce the deficit, increase the deficit, or if they had no opinion.  The results of the study by gender is given below.

Gender Reduce the Deficit Increase the Deficit No Opinion Total
Female 250 220 70 540
Male 310 120 30 460
Total 560 340 100 1000

At the .05 level, test the hypothesis that the opinion of those surveyed in regards to the deficit is independent of gender.

  1. The null hypothesis is stated as:  H0 :  The opinion about the budget deficit is independent of gender.  The alternative hypothesis is stated as:  H1:  The variables are not independent.

  2. The critical value is based upon (r - 1)(c - 1) degrees of freedom.  We have two rows of data and three columns of data.  Therefore the degrees of freedom equal:  (2 - 1)(3 - 1) = 2.  The critical value with a significance of .05 and 2 degrees of freedom equals 5.99.  The decision rule:  Reject H0 in favor of H1 if the test statistic is > 5.99.  Otherwise fail to reject H0.

  3. Compute the test statistic.  Now we can use Excel to make our calculations easier.  First open Excel and have a blank worksheet ready.

Observed (O) Expected (E) (O - E)2 / E
250    
310    
220    
120    
70    
30    

You will need to once again calculate your expected values (E) using formula 9.3.  Place those values in your spreadsheet in column B.

Observed (O) Expected (E) (O - E)2 / E
250 302.4 9.0799
310 257.6 10.6590
220 183.6 7.2166
120 156.4 8.4716
70 54 4.7407
30 46 5.5652
Observed (O) Expected (E) (O - E)2 / E
250 302.4 9.0799
310 257.6 10.6590
220 183.6 7.2166
120 156.4 8.4716
70 54 4.7407
30 46 5.5652
    45.7330
  1. Make the decision.  Since our test statistic value of 45.733 is greater than our critical value of 5.99, we reject H0

  2. Summarize the results.  This indicates that people's opinion about the deficit is not independent of gender.

Analysis of Variance

Introduction


The Analysis of Variance test (ANOVA) is used to compare the means between three or more normal populations to determine if they are equal or not.  We will use Excel to complete all of our calculations since doing the calculations manually is very time consuming.

Basic Thoughts and Terms


Analysis of Variance uses the F distribution as its statistical test.  Tables for the F distribution can be found in the back of your textbook.  The F distribution has the following characteristics:

  1. Since the F distribution values are based upon degrees of freedom, there is a different F distribution for different degrees of freedom in the numerator and denominator.

  2. There is an infinite number of values for the F distribution from 0 to infinity.

  3. The F distribution is positively skewed to the right.

  4. As the values of X increase, the curve gets closer to the x axis but it never intersects it.

There are several different ways that ANOVA can be used to compare populations.  For our purposes here, we will be conducting tests based upon a single factor.  This is called a one-way ANOVA test.  Using statistical software packages like Excel or SPSS enable us to conduct more sophisticated tests using two or more factors without difficulties.

ANOVA testing requires that the populations under study are normally distributed, the samples are obtained randomly from independent populations, and the population variances are assumed to be equal.  The number of samples from each independent population can be different.  In other words, the value of n for each sample can be different.  The interesting part of this test is that even though it is a test of means, variances are actually used in the calculations.

Procedural Steps Using Hypothesis Testing


We will once again use the five step hypothesis testing process for ANOVA testing.

  1. State the null and alternative hypothesis.  For ANOVA the null hypothesis is stated as:  H0:  All population means are equal.  For example, if we are testing four populations, then we would state:  H0: µ1 = µ2 = µ3 = µ4.  Or we can simply state:  All population means are equal.  The alternative hypothesis is stated as:  H1:  Not all population means are equal.  It is important to note that ANOVA testing does not tell you which mean does not equal the other means, or the combination of means that are not equal.  It will only tell you that at least one mean does not equal the other means.

  2. Find the critical value.  The best way to find the critical value for an ANOVA problem, is to extract the number from your Excel output.  It is found under the column titled F Crit.  To find the critical F value manually, we must first calculate the degrees of freedom for our numerator and our denominator.  The degrees of freedom for the numerator is equal to k - 1 where k is the number of samples.  For example, if we have independent samples from five populations, k equals 5 and our degrees of freedom equal 4.  The degrees of freedom for our denominator is equal to T - k, where T is the total number of items or variables in all samples, and k is the number of samples.  So if we have 5 independent populations, our k value is 5.  If each sample contained 6 items, then our value for T equals 30 (6 + 6 + 6 + 6 + 6 + 6).  The F table has column values for the numerator and row values for the denominator.  Simply take your numerator value and your denominator value to find the appropriate F value.  For example, if our numerator is equal to 10 and our denominator is equal to 15, our F value from the table is equal to 2.54.  HINT:  Excel automatically provides you in its output the critical value located under the column called F Crit.  The same basic format is used for the decision rule.  Reject H0 in favor of H1 if our test statistic F (calculated F value in step 3) is > the critical value from step 2, otherwise fail to reject H0.  The decision rule for ANOVA testing will always use the > symbol.

  3. Compute the test statistic.  The test statistic for ANOVA testing can be found in Formula Box 9.4.

Formula Box 9.4.  Test Statistic for ANOVA Testing.

Formula for the test statistic for ANOVA

As you can see from the formula, we will be making two sets of computations.  One set is for the numerator and is concerned with the estimated standard error between the independent populations, and the other set of computations is concerned with the estimated standard error within each independent population.

You should stop reading further and skip to the Excel instructions, unless you intend on finding the values manually.  It is STRONGLY suggested that you let Excel do the work for you which saves about 15 - 20 minutes per problem!

Both of these formulas will need to be broken down into formulas that we can use to find our values.  Let's start with the estimated standard error within the populations, which is the denominator of our formula in Formula Box 9.4.  The formula in Formula Box 9.5 shows the computations needed.

Formula Box 9.5.  The Estimated Standard Error Within.

The within formula for ANOVA

The formula indicates that for each independent population you have, you take each item or data value, subtract the mean for each set of items or data values from each item or data value, square it, and add them together.  Once again T is equal to the total number of items or data values in all samples, and k is equal to the number of independent populations that we have samples from.

At this point it is not unusual for this formula to look very intimidating and confusing.  That's OK!  We are going to continue our discussion by looking at the formula for the numerator, our estimated standard error between.  Once we are able to apply these formulas to a problem, it will become easier for you to understand how they are used.

Formula Box 9.6.  The Estimated Standard Error Between.

The between formula for ANOVA

Wow!  What a formula!  It basically states that for each independent population sample, take the of sample items (n) and multiply it by the squared difference between each sample mean and the grand mean.  Then add them together and divide by the number of samples (k) minus 1.  But wait, what is a grand mean?  The grand mean is simply the mean of all of your sample means, weighted by the number of items in each sample.  The grand mean has its own formula which you must first calculate in order to find the estimated standard error between.

Formula Box 9.7.  The Grand Mean.

The grand mean formula for ANOVA

Let's interpret the formula for the grand mean.  It states that for each sample, take the number of items in the sample and multiply it by the mean for the sample.  You do this for each sample you have and then add them together.  Finally you divide by the total number of items from all samples.  The value k in the formula represents the number of samples from independent populations that you have.

So in order to find the values for our test statistic indicated in Formula Box 9.4, we must first calculate the estimated standard error within and the estimated standard error between.  In order to calculate the estimated standard error between, we must first calculate the grand mean.

Once we have all of our values, we place them into our test statistic for F.  Then we can proceed to step 4.

  1. Make the Statistical Decision.  At this stage we compare our critical value of F from step number 2 with our calculated value of F from step 3.  Using the decision rule, we make our decision to reject or fail to reject H0.

  2. Summarize the Results.  This is where we state what we are going to do as a result of our analysis.

In the past, many students have panicked over the thought of manually calculating ANOVA.  It was very time consuming and often prone to math errors and mistakes.  Excel is great at ANOVA problems so let's look at using it!

Using Excel to Calculate ANOVA


Practice Exercise 9.5.  Using Excel to Calculate ANOVA.


Excel 2007

A research study was conducted looking at the ages of community college students in Michigan.  A random sample of student ages was obtained from three colleges being studied.  The ages can be found in the table below.

Kellogg Community College Kalamazoo Valley Community College Jackson Community College
24 18 31
33 36 45
19 29 23
41 20 26
22   19

Test the hypothesis at the .05 level of significance that there is no difference in ages between the three community colleges.

  1. State the null and alternative hypothesis.  Our null hypothesis is:  H0  = µ1 = µ2 = µ3.  The alternative hypothesis is:  H1:  Not all population means are equal.

  2. Find the critical value.  Our numerator value equals 3 - 1 = 2.  Our denominator value equals 14 - 3 = 11.  Our critical F value from the table equals 3.98.  You can also get the critical value from your Excel output.  The decision rule:  Reject H0 in favor of H1 if our calculated test statistic value of F is > 3.98, otherwise fail to reject H0.

  3. Compute the test statistic.  Now instead of using the formula, we use Excel.  Open a blank worksheet in Excel.  Copy the three columns of data from the table above, including the table headings into the blank worksheet.  Your first column should start in cell A1 and have the title Kellogg Community College.  Adjust your column widths as needed.

From the Excel menu ribbon, choose Data, Data Analysis, ANOVA:Single Factor

Excel will then ask for the data input range.  Using your mouse highlight the entire range of data excluding your data labels at the top.  Your input range should be A2:C6.  For the Grouped by box, click in the circle for Columns.

In the Alpha box, enter the significance level .05.  Under Output Options, click in the circle next to Output Range, and type in A10 in the box next to Output Range.  Now click OK.

Your output from the analysis should appear starting in cell A10 and it should look like the example below.

Anova: Single Factor          
           
SUMMARY            
Groups Count Sum Average Variance    
Column 1 5 139 27.8 81.7    
Column 2 4 103 25.75 69.58333    
Column 3 5 144 28.8 101.2