Correlation and Regression
Lecture 10



Correlation – measure of linear association between two variables

  1. Pearson’s correlation

    1. Key word is “linear”

Equation 2

    1. cov shows how two variables vary together

    2. If X = Y, then

Equation 3

    1. var(X) is population variance

    2. If we estimate cov(X, Y), var(X), and var(Y) by population formulas, then

Equation 4

  1. Correlation ranges [-1, +1]

    1. If r = -1, then

Perfect Llinear Negative Correlation

    1. If r = +1, then

Perfect Positive Correlatoin

    1. If r = 0, then

Zero Correlation

    1. Note – exponential function would have a high correlation, but it is not linear

Correlation of an Exponential Function

  1. First step

    1. Construct a scatter diagram of the data

    2. What does the data look like?

  2. Define

    1. X is the independent variable

    2. Y is the dependent variable

    3. Define the function, Y = f(X)

      1. Y is a function of X

      2. Establishes causality

      3. Correlation does not establish causality

  1. Example – Chest pains and shortness of breath may have a high positive correlation

    1. One does not cause the other

    2. Clogged arteries can cause these two symptoms

    3. Be very careful, correlation does not establish causality

An Example showing Causality

  1. Correlation t-statistic

    1. Assumptions

      1. Both variables are normally distributed

      2. Both have a linear relationship

    2. Hypothesis test

Equation 5

    1. The t-test is:

Equation 6

    1. Notation

      1. r is the correlation coefficient

      2. n is the number of observations

      3. df = n – 2

      4. The 2 is because correlation involves two variables

  1. Spearman Rank Correlation

    1. Reduces problems with outliers

    2. Can handle non-linear relationships, like exponential functions


Regression Equation


  1. You are imposing a relationship onto the variables

Equation 7

    1. Yi is the dependent variable

      1. Value is obtained from Xi and ui

    2. Xi is the independent variable

    3. b1 and b2 are parameters

      1. These are estimated from the data

    4. ui is the random noise term, ui ~ N(0, s2)

      1. Notation

        1. N is normally distributed

        2. Mean is 0

        3. Variance is s2

      2. Having normally distributed noise allows us to calculate confidence intervals and perform hypothesis testing

      3. Estimates of b1 and b2 and predictions of Yi are influenced by the variance of the noise

Linear Regression

  1. We find the Equation 8 and Equation 9 that minimizes the errors for the data points

    1. Some errors are positive

    2. Other errors are negative

    3. We cannot add the errors because they may cancel

    4. We square the errors terms to make them all positive

  2. Derivation

Starting with the equation

Equation 7

Solve for ui, which yields

Equation 1

Square the errors to make them positive

Equation 11

This is only for one data point. We want to minimize the total errors of all the data points.
     Sum over all the data points
     Define Sum of Squared Errors (SSE)

Equation 12

We want to find the minimum, thus we take the first partial derivatives with respect to the betas

Equation 13

The second step is the Chain Rule from Calculus.  I can put the 2 in front of the summation because each term in the summation has a 2.  Set the partial derivative to zero, in order to find minimum value

Equation 14

Now solve equation for b1,   It is debatable when you should add hats to the estimators.  I added at this step when partial was set to zero

Equation 15

Summation is a linear operator.  We can apply the summation to all terms in parenthesis
     b1 is constant that is summed n times
     b2 is constant and multiplied by all X’s in summation
     Can bring this to the front

Equation 16

The last step works because we substitute the average for y and average for x into the equation.  Repeating these steps to get the estimator for b2

Equation 17

Similarly, set the partial to zero and solve for b2,

Equation 18

We substitute the estimator for b1 into the equation, Equation 19, which yields

Equation 20

I did not break the last summation apart.  This is to solve for the estimator for b2

Equation 21

  1. This is only good for one X variable. We can generalize least squares to Multiple Regression. There is k parameters to estimate.

Equation 22

  1. Example – Demand for Pepsi

    1. Q is quantity and P is market price

Equation 23

Equation 24

    1. Use least squares to find betas

    2. I fitted a line through my data points that gives me the best fit


Goodness of Fit


  1. The goodness-of-fit measure is, R2.

Equation 29

    1. If R2 = 0, then the regression equation has no fit
    2. If R2 = 1, then a regression equation has a perfect linear fit
    3. Also, n = k which is algebraic system
    4. Problem – As the number of x variables increases, R2 always gets larger
  1. Adjusted R2 - Penalize the goodness of fit if more variables are added

Equation 32.

    1. As the number of independent variables increase, the penalty increases, but the error could decrease if new variables explain ‘y’ better.
    2. Sometimes Equation 33 can be negative, indicating a very poor fit
    3. Note – Very important; it has to be the same y variable
    4. One model it cannot be y and in another it is ln (y)


Analysis of Variance )ANOVA)


In terms of regressions, ANOVA is used to test hypothesis in many types of statistical analysis
Sum of Squared Total (SST) is defined as:

Equation 25.

Yi is the dependent variable in the regression.  The Equation 16 is the total variation for observation i.
Sum of Squared Regression (SSR) is defined as:

Equation 27.

This is the variation explained by the regression
Sum of Squared Errors (SSE), which was earlier defined as:

Equation 28

SSE is the amount of variation not explained by the regression equation.  Thus, SST = SSR + SSE, which is proved in the lecture.
We can use this information to calculate the R2 statistic, showing the relationship:

Equation 30

Problem – the more parameters added to the regression, the higher the R2.

R2 = 1, if n = k, the number of parameters equal observations

Now we need the degrees of freedom for each measure:

          Sum of Squared Regression (SSR)      df =k – 1

          Sum of Squared Errors (SSE)             df = n – k

          Sum of Squared Total (SST)               df = n – 1

We calculate the Mean Square (MS)

         Regression (MS) = SSR / (k – 1)

          Residual (MS) =SSE / (n – k)

          Total (MS) NA

  • Additional information

    • When you have a variable with a normal distribution

      • If you add or subtract if from other variables with a normal distribution, then it is still normally distributed

      • Calculating a mean is a first moment

    • If you square a random variable with a normal distribution, then you get a chi-square distribution with degrees of freedom.

      • The squares are variances and called the second moment

      • All the Mean Squares are distributed as chi squares

  • AF-distribution – can test a whole group of hypothesis or test a whole regression model

    • F- test can test many other things

    • The F-test is a ratio of two chi-squares

    • The F-test is a one-tailed test associated with the right-hand tail.

    • Squaring makes all terms positive

  • The F-distribution and test is as follows:

The F Distribution

The hypothesis test

     H0: Regression model does not explain the data, i.e. all the parameters estimates are zero

     Ha: Regression model does explain the model, i.e. at least one parameter estimate is not zero

First, we need the critical value: a = 0.05, df1 = 1, and df2 =58

     In Excel, =finv(0.05,1,58)

     Fc = 4.00

Excel calculates the ANOVA



df SS MS F Significance F


1 33.06087 33.06087 6.489695 0.013524
Residual 58 295.4732 5.094365

Total 59 328.534




Calculate the F-value =

Equation 31

The computed F exceeds the Fc, so reject the H0, and conclude at least one parameter is not equal to zero.
     How many observations? 10
     How many parameters, k? 4

     Degrees of freedom for error df = 10 – 4 = 6
     Degrees of freedom for total df = 10 – 1 = 9



Df SS MS F Significance F


3 5001.859635 1667.287 232.4265 1.35E-06


6 43.04036468 7.173394


9 5044.9


Trend Regression


  1. Time series – date collected over time

    1. Could have patterns over time

Linear Time Series

Equation 34

    1. Trend variables always start at 1

    2. You never put in the year

    3. You can add powers of the trend

  1. An example

A Cubic Time Series Trend

Equation 25

    1. Use adjusted R2 to find stopping point

    2. Choose the Regression with largest adjusted R2

    3. Note – R2 will always take the largest regression

  1. Exponential Regression

An Exponential Growth Trend

Equation 36

    1. Transform equation until linear in parameters by taking the natural logarithm of both sides

Equation 37

    1. Re-parameterize to avoid the natural log in the intercept

    2. Just take the natural logarithm of the variables except the trend variable and then do a standard linear regression