Business Statistics Unit 10

Regression and Correlation Analysis

Introduction


In this final unit you are introduced to two powerful techniques that are used to determine if a relationship exists between two or more sets of data and the strength of that relationship, called regression analysis and correlation analysis.  We will not use the hypothesis testing method as indicated in your textbook to solve problems.  Instead we will concentrate on specific items that measure correlation, significance, and the relationship between the variables.

Basic Thoughts and Terms


Regression analysis is a statistical technique used to develop an equation which best represents the pattern of a set of data variables.  A regression equation contains a dependent variable, which is the variable we are trying to estimate.  The values for this variable are usually plotted on the y axis of a graph, and the symbol used to represent the dependent variable is y, also called y prime.  The regression equation also contains one or more independent variables that influence or have a statistical relationship with the dependent variable y.  The independent variable is given the symbol x for a simple regression analysis.  If we have more than one independent variable then we are conducting multiple regression analysis and each independent variable x has a number associated with it.

There are many types of regression analysis associated with a single independent variable, or multiple variables.  For our purposes we will examine simple linear regression which is a type of regression analysis using one dependent variable and one independent variable.  The term linear is used to describe the relationship between the dependent and independent variables.  A linear relationship is one that tends to approximate a straight line.

A positive relationship occurs when the variables move together in the same direction (increasing or decreasing).  A negative relationship occurs when the variables move in the opposite direction (one increases while the other decreases).

An important thought to keep in mind whenever we look at the relationship between two or more variables is to consider that the relationship between the variables may be statistically significant, or it may just be coincidence.  For example, it has been shown that in many European countries, the birth rate of human babies is higher in countries with larger populations of storks.  Does this mean that storks increase the birth rate of human babies?  NO!  This is called an accidental or spurious relationship.  In statistical research we look for logical relationships.  Looking at our birth rate situation in Europe, what if we found out that in those countries with higher birth rates, there was a greater number of female residents between the ages of 18 - 40 than in other European countries?  This could be an example of a logical relationship since there is a logical connection between the number of female residents in that specific age group and birth rates.

Simple Linear Regression


The first step associated with conducting a simple linear regression analysis is to obtain a random sample of data values for our independent and dependent variables.  Next we often graph the relationship on a simple graph to determine if we may have a linear relationship and if so, is it a positive or negative linear relationship.  A positive relationship is one where both variables ( x and y) increase.  A negative relationship is indicated when the value of one variable increase as the other decreases (i.e. x increases while y decreases).

Take a look at the data in table 10.1 below.  This data is from a random sample of employees measuring the relationship between the level of training and production output.

Table 10.1.  Employee Training and Output.

Employee Number Hours of Training (x) Output per Hour (y)
1 6 14
2 10 25
3 20 48
4 14 35
5 8 15
6 12 28
7 24 54
8 16 41
9 14 37
10 22 50

We can graph the data from the table above using Excel and plot the data points.  Notice in Figure 10.1 that the data points plotted approximate a diagonal rising line.  This indicates that the data is linear and has a positive relationship.  As the number of hours of employee training increases, the output per hour also increases.

Figure 10.1.  Graph of Employee Training and Output.

Graph of employee training and output

A simple linear regression equation consists of a formula which represents the various data values.  Formula Box 10.1 contains the formula for a simple linear regression line.

Formula Box 10.1.  A Simple Linear Regression Line.

Simple Linear Regression Line Formula

The symbol y or y^ represents the value of the dependent variable y.  The symbol a represents the y intercept or the value of y when x is equal to zero.  The symbol a also represents the constant, or a numerical value that is given when you conduct a regression analysis in Excel.  The symbol b is the slope of the regression line.  It is a numerical value that when positive indicates a positively sloped line and when negative indicates a negatively sloped line.  Finally the symbol x represents a value of the independent variable.

For each given value of x, we are able to compute the value of y using several formulas that calculate the value of b, our slope, and a, the constant.  The formula for b, the slope, is given in formula box 10.2.

Formula Box 10.2.  The Slope of a Regression Line.

Slope of a regression line formula

Does this formula look a little intimidating to you?  It does involve several steps and a lot of time to solve.  The formula for a, our constant can be found in formula box 10.3.

Formula Box 10.3.  The Constant of a Regression Line.

The Constant of a regression line formula

The value of y is equal to the mean of the y variables and the value of x is equal to the mean of the x variables.

We will not use these formulas to find our regression equation.  Instead we will use Excel to conduct a regression analysis and find our regression line for us.  In Practice Exercise 10.1, we will use Excel to find our regression line for the data in Table 10.1.  Before we move to using Excel to find our regression line, another important concept needs to be discussed.  Often combined with regression analysis, we wish to determine the strength of our variables used in our regression equation.  To measure the strength of the relationship between the variables described by a regression equation, we use correlation analysis.

Correlation Analysis


Correlation Analysis is used to measure the strength, closeness, and direction of a relationship between two or more variables.  The first test that can be conducted is called the Coefficient of Determination (r2).  The Coefficient of Determination test tells us the amount of variation in our variable y that is explained by the variation in our variable x.  It gives us a value from 0 to 1.00 that indicates how much influence the independent variable x has on our dependent variable y.  It does NOT tell use that x causes y.  There is a specific formula that can be used to calculate the coefficient of determination but for our purposes we will have Excel calculate it for us. 

The goal we have when we choose our independent variable (s) is that they can account for most of if not all of the variation in our dependent variable x.  Therefore the higher the r2 value, the better.

Another test that can be conducted is called the Coefficient of Correlation test.  This test as explained here is also known as the Pearson product moment correlation coefficient.  The Coefficient of Correlation (r) also called multiple r in Excel, tells us if we have a positive or negative correlation between the two variables x and y.  The value of r can be anywhere between -1.00 and 1.00.  If the value is 1.00 we have perfect positive correlation and if it is -1.00 we have perfect negative correlation.  The coefficient of correlation can be calculated by taking the square root of the coefficient of determination.  For our purposes, we will have Excel calculate the value for us.  Note that Excel does not provide the correct sign for r, only the value.  To find the correct sign for r you simply look at the value for b.  If b is positive, r is positive.  If b is negative, then r must also be negative.

Be fore we go any further, you need to understand the limitations and interpretation that is needed when using correlation analysis.  If a high r or r2 value exists, this indicates that:

  1. It is possible that there is a direct cause and effect relationship between the variables.  Simply stated, the variable x causes a change in y.

  2. It is also possible that the reverse is true.  The variable y causes x to occur or change.

  3. The relationship between the two variables is actually influenced and caused by a third variable.  Here is a great example of this.  A recent study was completed related to people getting sunburned at the beach.  It was found that people who consumed an excessive amount of alcohol (adult beverages) were more likely to get sunburned at the beach.  So, does consuming adult beverages cause people to get sunburned?  No!  The third variable here is one of the affects of the consumption of adult beverages:  forgetting to apply or reapply sunscreen.  This is the variable that influences the outcome.

  4. Many other variables exist that have not been documented or considered.  For example, what determines your grade in this class?  Is it simply the amount of time you spend studying, or are other factors present?  Other factors include your knowledge and experiences with math, your motivation and interest in the subject, how the subject is learned, the textbook used, the knowledge of the instructor, the environment you study in, etc.  These other variables may have a significant affect on your grade in addition to the time you spend studying.

  5. Finally we must acknowledge once again that the relationship could be a coincidence or accidental.  Common sense is often the best resource here to determine if a relationship is coincidental or not.

Other Common Tests Associated With Regression Analysis


When we conduct a regression and correlation analysis we are always concerned about the validity of our tests.  Statistical software packages like Excel make it easier for us to measure the validity of our analysis.  In regards to the possibility of error associated with our regression line, the standard error of the estimate is often calculated.  The standard error of the estimate is used to provide us with a measure of the spread or scatter around the regression line.  The smaller the value of the standard error, the more likely our regression equation and line represents the data values.  A formula exists that can be used to manually calculate the standard error, but we will rely upon Excel to calculate it for us.

Another popular test conducted on the regression equation is the t test for the value of b, which is the slope.  The t test for slope provides us with further evidence that the regression equation and the relationship between the two variables is valid.  Once again, the t test can be calculated manually or by having Excel calculate it for us.

Finally, a special ANOVA test is conducted which tests the validity of the relationship between x and y.  Some software program conduct this test as a t test of the correlation coefficient. This test examines the same relationship between the two variables as the t test.  However once we start using more than one independent variable the ANOVA test examines a slightly different hypothesis than the t test.

Practice Exercise 10.1.  Using Excel for Simple Linear Regression.


The information from Table 10.1 is reproduced below.  Using the data from the table construct a simple linear regression line using Excel.

Employee Number Hours of Training (x) Output per Hour (y)
1 6 14
2 10 25
3 20 48
4 14 35
5 8 15
6 12 28
7 24 54
8 16 41
9 14 37
10 22 50

Before we start it is important to note that Excel has several functions related to regression and correlation analysis.  The Excel functions LINEST, TREND, and CORREL are often used to separately analyze data.  For simplicity's sake we are going to use the basic regression option under data analysis which provides use with all of the output needed.

Excel 2007

  1. Open Excel and have a new blank spreadsheet available.  Copy the entire table above into the spreadsheet starting at cell A1.  Adjust your column widths as needed.
  2. Click on the Data Analysis folder, Data Analysis, and Regression.  Then click on OK.


  3. The Regression Box appears.  Enter the range for the y axis, which is the data for Output per Hour.  It should be the range C2:C11.  Now enter the range for the x axis.  It should be the range B2:B11.
  4. Click in the box next to the part labeled Confidence Level.  If it is not already listed, type in 95 for the confidence level.
  5. Move to the Output Options section.  Click in the circle next to the Output Range box, then enter A15 in the box.
  6. In the Residuals section, click on the box next to Residuals and Line Fit Plots.


     
  7. Now click OK.
  8. Your output should look like the output below.
SUMMARY OUTPUT                
                 
Regression Statistics              
Multiple R 0.983785288              
R Square 0.967833493              
Adjusted R Square 0.96381268              
Standard Error 2.678345214              
Observations 10              
                 
ANOVA                
  df SS MS F Significance F      
Regression 1 1726.711735 1726.711735 240.7058998 2.96578E-07      
Residual 8 57.38826467 7.173533084          
Total 9 1784.1            
                 
  Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.806491885 2.343044473 0.344206819 0.7395636 -4.596578355 6.209562125 -4.596578355 6.209562125
X Variable 1 2.321473159 0.149630559 15.51469947 2.96578E-07 1.97642447 2.666521847 1.97642447 2.666521847
                 
                 
                 
RESIDUAL OUTPUT                
                 
Observation Predicted Y Residuals            
1 14.73533084 -0.735330836            
2 24.02122347 0.978776529            
3 47.23595506 0.764044944            
4 33.3071161 1.692883895            
5 19.37827715 -4.378277154            
6 28.66416979 -0.664169788            
7 56.52184769 -2.52184769            
8 37.95006242 3.049937578            
9 33.3071161 3.692883895            
10 51.87890137 -1.878901373            

Graph of a regression line showing a linear relationship

NOTE:  The default graph type for regression analysis in Excel 2007 may not be a scatter plot graph.  If your graph is not the same type of graph, change the type of graph in Excel.

Excel 2003

  1. Open Excel and have a new blank spreadsheet available.  Copy the entire table above into the spreadsheet starting at cell A1.  Adjust your column widths as needed.
  2. Click on Tools, then Data Analysis, and Regression.  Then click on OK.
  3. The Regression Box appears.  Enter the range for the y axis, which is the data for Output per Hour.  It should be the range C2:C11.  Now enter the range for the x axis.  It should be the range B2:B11.
  4. Click in the box next to the part labeled Confidence Level.  If it is not already listed, type in 95 for the confidence level.
  5. Move to the Output Options section.  Click in the circle next to the Output Range box, then enter A15 in the box.
  6. In the Residuals section, click on the box next to Residuals and Line Fit Plots.
  7. Now click OK.
  8. Your output should look like the output below.
SUMMARY OUTPUT                
                 
Regression Statistics              
Multiple R 0.983785288              
R Square 0.967833493              
Adjusted R Square 0.96381268              
Standard Error 2.678345214