Linear regression by hand

For this assignment, you will:

You will be walked through these steps. For each code chunk, fill in code and change it from eval=FALSE to eval=TRUE when you have it working. You will need to at minimum complete the object assignment steps <- that have been left unfinished, but you may also occasionally need to add in an extra line or two of code to tweak things like row and column names of these objects.

Simulating data

We are studying the relationship between grades of UW students in Calculus I and their high school preparation. Let’s suppose we have measured these variables:

We don’t actually have these data, so we’re going to make up some semi-realistic numbers assuming there are \(n=2000\) students. I will fill in these parts for you but you should study the functions and try to understand what they are doing.

set.seed(1234)
n <- 2000 # number of UW calc I students

# simulate high school math GPA
hs_math_gpa <- rnorm(n, mean = 3.5, sd = 0.5)
# truncate to 2.0-4.0
hs_math_gpa <- ifelse(hs_math_gpa < 2.0, 2.0,
                      ifelse(hs_math_gpa > 4.0, 4.0, hs_math_gpa))

# simulate previous calculus in high school:
# - 75% of people with HS math GPAs over 3.6 took HS calculus
# - 40% of people with HS math GPAs under 3.6 took HS calculus
# binomial data with 1 trial per student, probabilities as above
hs_calculus <- rbinom(n, size = 1,
                      p = ifelse(hs_math_gpa >= 3.6, 0.75, 0.40))

# simulate precalculus at UW:
# - if no high school calculus, 70% take precalculus, regardless of grade
# - if high school calculus, 60% take precalculus if HS grade < 3.5,
#   and 25% otherwise
uw_precalc <- rbinom(n, size = 1,
                     p = ifelse(hs_calculus == 0, 0.70,
                                ifelse(hs_math_gpa < 3.5, 0.60, 0.25)))

Our outcome is UW Calculus I grade, which is 0.0 or 0.7-4.0 in 0.1 increments. Let’s suppose the following noisy relationship holds for each student \(i\), which we want to recover from our data:

\[\text{UW Calculus I grade}_i = 0.3 + 0.7 \cdot \text{HS math GPA}_i + 0.3 \cdot \text{HS calculus}_i + 0.1 \cdot \text{UW precalculus}_i + \varepsilon_i\] \[\varepsilon_i \sim \text{Normal}(\text{mean}=0, \text{SD}=0.5)\]

true_beta <- c("Intercept" = 0.3,
               "hs_math_gpa" = 0.7,
               "hs_calculus" = 0.3,
               "uw_precalc" = 0.1)
true_sigma <- 0.5
uw_calculus_gpa <- true_beta["Intercept"] +
    true_beta["hs_math_gpa"] * hs_math_gpa +
    true_beta["hs_calculus"] * hs_calculus +
    true_beta["uw_precalc"] * uw_precalc +
    rnorm(n, mean = 0, sd = true_sigma)
# we don't see exact GPAs, just nearest 0.1
uw_calculus_gpa <- round(uw_calculus_gpa, 1)
# truncate under 0.7 or over 4.0
uw_calculus_gpa <- ifelse(uw_calculus_gpa > 4.0, 4.0,
                          ifelse(uw_calculus_gpa < 0.0, 0.0,
                                 ifelse(uw_calculus_gpa < 0.7, 0.7,
                                        uw_calculus_gpa)))

Linear regression with lm()

To get your grounding, first we’ll make a data frame with the relevant variables, and then fit it using the lm() function you have already seen. Use the independent and dependent variables given above.

calculus_data <- data.frame(uw_calculus_gpa, hs_math_gpa, hs_calculus, uw_precalc)
head(calculus_data)
##   uw_calculus_gpa hs_math_gpa hs_calculus uw_precalc
## 1             1.9    2.896467           0          1
## 2             2.7    3.638715           1          0
## 3             3.3    4.000000           1          0
## 4             3.3    2.327151           1          1
## 5             3.6    3.714562           1          0
## 6             3.5    3.753028           0          1

You need to fill in this part:

# this uses the lm function seen in Week 1
calculus_lm <- lm(uw_calculus_gpa ~ hs_math_gpa + hs_calculus + uw_precalc,
                  data = calculus_data) 
summary(calculus_lm)
## 
## Call:
## lm(formula = uw_calculus_gpa ~ hs_math_gpa + hs_calculus + uw_precalc, 
##     data = calculus_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36849 -0.32935  0.00867  0.34229  1.34886 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.41583    0.09176   4.532 6.20e-06 ***
## hs_math_gpa  0.66332    0.02596  25.551  < 2e-16 ***
## hs_calculus  0.29518    0.02314  12.758  < 2e-16 ***
## uw_precalc   0.11374    0.02281   4.986 6.68e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.472 on 1996 degrees of freedom
## Multiple R-squared:  0.3416, Adjusted R-squared:  0.3406 
## F-statistic: 345.2 on 3 and 1996 DF,  p-value: < 2.2e-16

Background on math of linear regression

For ordinary least squares linear regression, we encode our independent variables in a design matrix \(\mathbf{X}\) and our dependent variable (outcome) in a column vector \(\mathbf{y}\). In general, if we have \(n\) observations and \(p\) independent variables, \(\mathbf{X}\) is a matrix with \(n\) rows and \(p+1\) columns that looks like:

\[\mathbf{X} = \left[ \begin{array}{ccccc} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ 1 & x_{31} & x_{32} & \ldots & x_{3p}\\\ldots & \ldots & \ldots & \ldots & \ldots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \end{array} \right]\]

Each of the columns \(x_{.1}, x_{.2}, \ldots, x_{.p}\) contains one of the independent variables.

The outcome vector \(\mathbf{y}\) looks like:

\[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \ldots \\ y_n \end{array} \right]\]

In matrix notation (using matrix multiplication), we model the relationship between \(\mathbf{y}\) and \(\mathbf{X}\) as \(\mathbf{y} = \mathbf{X} \cdot \mathbf{\beta} + \text{noise}\) where

\[ \mathbf{\beta} = \left[ \begin{array}{c} \beta_0 \\ \beta_1 \\ \beta_2 \\ \ldots \\ \beta_p \end{array} \right] \]

One can show that the estimate \(\hat{\mathbf{\beta}}\) that minimizes squared error between \(\mathbf{y}\) and fitted values \(\mathbf{X} \cdot \mathbf{\beta}\) is:

\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}\]

where the multiplication is matrix multiplication, \((\cdot)^T\) is the transpose operator, and \((\cdot)^{-1}\) is the matrix inverse operator.

We estimate the variance of the residual noise \(\hat{\sigma}^2\) (mean squared error) as

\[\hat{\sigma}^2 = \frac{(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})^T \cdot (\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})}{n - p - 1}\]

We get estimated covariances for linear regression coefficient estimates using the following formula:

\[\widehat{\text{Var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\]

We then take the square root of the diagonal of \(\widehat{\text{Var}}(\hat{\mathbf{\beta}})\) to get the standard errors for each coefficient. (The off-diagonal terms are estimated covariances between parameter estimates, which is closely related to the estimated correlations.)

Setting up the design matrix and outcome

This may look intimidating if you haven’t seen it before, but fortunately we can break up the calculation into small pieces as we calculate things in R.

First, make a numeric matrix X that looks like the matrix \(\mathbf{X}\) above, but customized for this problem where \(p=3\). Your first column should be a column of \(n\) 1’s (“intercept” term). Your second column should be the high school math GPA. Your third should be the high school calculus indicator. Your fourth should be the UW precalculus indicator. Make sure the columns are labeled.

# to make X, just cbind the variables together in the same
# order as the independent variables in the regression
# (so everthing matches up)
X <- cbind(1, hs_math_gpa, hs_calculus, uw_precalc)
# the colnames for the existing variables are already okay
# but I still want to name the intercept column
colnames(X)[1] <- "Intercept"
colnames(X)
## [1] "Intercept"   "hs_math_gpa" "hs_calculus" "uw_precalc"

For clarity, you should make a variable y and store our outcome variable to it.

y <- uw_calculus_gpa

Compute matrix quantities

The term \(\left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\) appears in both the formulas for \(\hat{\mathbf{\beta}}\) and for \(\widehat{\text{Var}}(\hat{\mathbf{\beta}})\). Let’s call this quantity \(\mathbf{A}\). You can compute it using the matrix multiplication, matrix transposes, and matrix inversion functions in the slides. Some of these you can replace with the crossprod() function if you like (?crossprod for help).

# t(X) is the transponse, %*% is matrix multiplication,
# solve takes the inverse
A <- solve(t(X) %*% X)
# alternative solution: crossprod does the same thing
A <- solve(crossprod(X))

Now, we want \(\hat{\mathbf{\beta}}\). Use \(\mathbf{A}\) and more matrix multiplication to compute this.

# this comes from looking at formula for beta hat and
# multiplying the additional terms needed
beta <- A %*% t(X) %*% y

With \(\hat{\mathbf{\beta}}\), you can compute the residuals \(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}}\), which go into your residual variance calculation.

# just filling in formula as above
residuals <- y - X %*% beta

Now let’s calculate the estimated residual variance \(\hat{\sigma}^2\). This has \(n-p-1\) in the denominator, so let’s compute \(p\) while we’re at it. You already know \(p=3\), but instead of hard-coding this, use a function to compute it as the number of columns of the design matrix minus 1. (That way, if had you added or deleted a column, the math would still be correct!) Additionally, after you’ve calculated the residual variance, I’ve put as.numeric() at the end to convert it to a single scalar number instead of a 1 by 1 matrix.

p <- ncol(X) - 1

residual_var <- t(residuals) %*% residuals / (n - p - 1)
residual_var <- as.numeric(residual_var)

# alternative calculation # 1:
residual_var <- crossprod(residuals) / (n - p - 1)
residual_var <- as.numeric(residual_var)

# alternative calculation # 2:
residual_var <- sum(residuals^2) / (n - p - 1)

Next, find the estimated covariance matrix of the coefficient estimates \(\widehat{\text{Var}}(\hat{\mathbf{\beta}})\) using \(A\) and \(\hat{\sigma}^2\). (Note: we needed the as.numeric() step above because otherwise residual_var is a 1 by 1 matrix, and we would be multiplying together two matrices whose dimensions can’t be multiplied. as.numeric() converts it to a scalar that multiplies all the entries, which is what we want.)

# covariance matrix of estimated regression coefficients
# is just the estimated residual variance times solve(t(X) %*% X)
# we calculuted earlier and stored as A
beta_covar <- residual_var * A

Finally, we go from estimated covariance matrix \(\widehat{\text{Var}}(\hat{\mathbf{\beta}})\) to standard errors by taking the square root of its diagonal.

# diag takes the diagonal of the matrix, sqrt makes it go
# from variance to standard deviation
beta_SE <- sqrt(diag(beta_covar))

Comparing results

Now, let’s make a 4 row by 3 column matrix comparing the “true” values used to generate the data and fitted parameters using two methods:

Use the pander package to display a nice formatted table. If everything went right, the two columns of fitted values should equal each other, and will be close to but not exactly the truth. You may want to modify rownames of beta_compare.

# truth is in true_beta
# estimate by hand is in beta
# the lm way is in the list summary(calculus_lm)
# as shown in Week 4
beta_compare <- cbind(true_beta,
                      beta,
                      summary(calculus_lm)[["coefficients"]][, "Estimate"])
colnames(beta_compare) <- c("Truth", "Manual", "lm")
library(pander)
pander(beta_compare, caption = "Comparison of linear regression parameters estimated manually with those from R's lm().")
Comparison of linear regression parameters estimated manually with those from R’s lm().
  Truth Manual lm
Intercept 0.3 0.4158 0.4158
hs_math_gpa 0.7 0.6633 0.6633
hs_calculus 0.3 0.2952 0.2952
uw_precalc 0.1 0.1137 0.1137

Now do the same thing for standard errors of the parameter estimates. The “true” covariance \(\text{Var}(\hat{\mathbf{\beta}})\) uses the same formula as the estimated covariance \(\widehat{\text{Var}}(\hat{\mathbf{\beta}})\) you calculated above, except substitute in \(\sigma^2\) in place of \(\hat{\sigma}^2\). You’ll need to take the square root of the diagonal to go from covariance matrix to individual standard errors.

# true covariance uses same formula as estimated covariance
# except we plug in the true variance. we have the true std deviation
# so we square that
true_covar <- true_sigma^2 * A
true_SE <- sqrt(diag(true_covar))
SE_compare <- cbind(true_SE,
                    beta_SE,
                    summary(calculus_lm)[["coefficients"]][, "Std. Error"])
colnames(SE_compare) <- c("Truth", "Manual", "lm")
pander(SE_compare, caption = "Comparison of standard errors of linear regression parameters estimated manually with those from R's lm().")
Comparison of standard errors of linear regression parameters estimated manually with those from R’s lm().
  Truth Manual lm
Intercept 0.0972 0.09176 0.09176
hs_math_gpa 0.0275 0.02596 0.02596
hs_calculus 0.02451 0.02314 0.02314
uw_precalc 0.02416 0.02281 0.02281

Last, compare the “true” residual variance \(\sigma^2\) with the values of the estimated residual variance obtained from the manual and lm methods. This will be a 1 row by 3 column matrix. The estimated residual standard deviation in the lm version can be found in the summary() list object in a list entry called sigma. You can convert this to residual variance by squaring it.

resid_var_compare <- cbind(true_sigma^2,
                           residual_var,
                           summary(calculus_lm)[["sigma"]]^2)
colnames(resid_var_compare) <- c("Truth", "Manual", "lm")
pander(resid_var_compare, caption = "Comparison of residual variances of manual linear regression with that from R's lm().")
Comparison of residual variances of manual linear regression with that from R’s lm().
Truth Manual lm
0.25 0.2228 0.2228