Rebecca Ferrell
April 20, 2016
So far we've been manipulating data frames, making visuals, and summarizing. This got you pretty far! Now we get more in the weeds of programming. Today is all about types of data in R.
A data frame is really a list of vectors, where each vector is a column of the same length (number of rows). But data frames are not the only object we want to have in R, e.g. linear regression output. We need to learn about vectors, matrices, and lists to do additional things we can't express with dplyr
syntax.
In R, we call a set of values a vector. We can create vectors by using the c
function (“c” for combine or concatenate).
c(1, 3, 7, -0.5)
[1] 1.0 3.0 7.0 -0.5
Vectors have length:
length(c(1, 3, 7, -0.5))
[1] 4
When doing arithmetic operations on vectors, R handles these element-wise:
c(1, 2, 3) + c(4, 5, 6)
[1] 5 7 9
c(1, 2, 3, 4)^3 # exponentiation with ^
[1] 1 8 27 64
Common operations: *
, /
, exp
= \( e^x \), log
= \( \log_e(x) \)
If we work with vectors of different lengths, R will recycle the shorter one by repeating it to make it match up with the longer one:
c(0.5, 3) * c(1, 2, 3, 4)
[1] 0.5 6.0 1.5 12.0
c(0.5, 3, 0.5, 3) * c(1, 2, 3, 4) # same thing
[1] 0.5 6.0 1.5 12.0
A special case of recycling involves arithmetic with scalars (a single number). These are vectors of length 1 that are recycled to make a longer vector:
3 * c(-1, 0, 1, 2) + 1
[1] -2 1 4 7
Recycling doesn't work so well with vectors of incommensurate lengths:
c(1, 2, 3, 4) + c(0.5, 1.5, 2.5)
Warning in c(1, 2, 3, 4) + c(0.5, 1.5, 2.5): longer object length is not a
multiple of shorter object length
[1] 1.5 3.5 5.5 4.5
Try not to let R's recycling behavior catch you by surprise!
Some functions operate on an entire vector and return one number rather than working element-wise:
sum(c(1, 2, 3, 4))
[1] 10
max(c(1, 2, 3, 4))
[1] 4
Some others: min
, mean
, median
, sd
, var
– you've seen these used with dplyr::summarize
.
Let's say we had some test scores and we wanted to put these on a standardized scale: \[ z_i = \frac{x_i - \text{mean}(x)}{\text{SD}(x)} \]
x <- c(97, 68, 75, 77, 69, 81, 80, 92, 50, 34, 66, 83, 62)
z <- (x - mean(x)) / sd(x)
round(z,2)
[1] 1.49 -0.23 0.19 0.31 -0.17 0.54 0.48 1.19 -1.30 -2.24 -0.35
[12] 0.66 -0.58
See also: scale
class
or str
will tell you what kind of vector you have. There are a few common types of vectors:
c(1, 10*3, 4, -3.14)
0:10
c("red", "blue", "yellow", "blue")
factor(c("red", "blue", "yellow", "blue"))
c(FALSE, TRUE, TRUE, FALSE)
There are shortcuts for generating common kinds of vectors:
seq(-3, 6, by = 1.75)
[1] -3.00 -1.25 0.50 2.25 4.00 5.75
rep(c(-1, 0, 1), times = 3)
[1] -1 0 1 -1 0 1 -1 0 1
rep(c(-1, 0, 1), each = 3)
[1] -1 -1 -1 0 0 0 1 1 1
We can get a special case of numeric vectors using the :
shortcut:
n <- 12
1:n
[1] 1 2 3 4 5 6 7 8 9 10 11 12
n:4
[1] 12 11 10 9 8 7 6 5 4
Character vectors come up when dealing with data like names, addresses, and IDs:
first_names <- c("Andre", "Beth", "Carly", "Dan")
class(first_names)
[1] "character"
Factors encode a (modest) number of levels, like for gender, experimental group, or geographic region:
gender <- factor(c("M", "F", "F", "M"))
gender
[1] M F F M
Levels: F M
Character data usually can't go directly into a statistical model, but factor data can. It has an underlying numeric representation:
as.numeric(gender)
[1] 2 1 1 2
We make logical vectors by defining binary conditions to check for. For example, we can look at which of the first names has at least 4 letters:
name_lengths <- nchar(first_names) # number of characters
name_lengths
[1] 5 4 5 3
name_lengths >= 4
[1] TRUE TRUE TRUE FALSE
You can do math with logical vectors, because TRUE
=1 and FALSE
=0:
name_lengths >= 4
[1] TRUE TRUE TRUE FALSE
mean(name_lengths >= 4)
[1] 0.75
What did this last line do?
Suppose we are interested in which names have an even number of letters or whose second letter is “a”:
even_length <- (name_lengths %% 2 == 0)
# %% is modulo operator: gives remainder when dividing
even_length
[1] FALSE TRUE FALSE FALSE
second_letter_a <- (substr(first_names, start=2, stop=2) == "a")
# substr: substring (portion) of a char vector
second_letter_a
[1] FALSE FALSE TRUE TRUE
&
is AND (both conditions must be TRUE
to be TRUE
):even_length & second_letter_a
[1] FALSE FALSE FALSE FALSE
|
is OR (at least one condition must be TRUE
to be TRUE
):even_length | second_letter_a
[1] FALSE TRUE TRUE TRUE
!
is NOT (switches TRUE
and FALSE
):!(even_length | second_letter_a)
[1] TRUE FALSE FALSE FALSE
We can subset the vector in a number of ways:
first_names[c(1, 4)]
[1] "Andre" "Dan"
first_names[-c(1, 4)]
[1] "Beth" "Carly"
TRUE
=keep, FALSE
=drop):first_names[even_length | second_letter_a]
[1] "Beth" "Carly" "Dan"
first_names[gender != "F"] # != is "not equal"
[1] "Andre" "Dan"
%in%
lets you avoid typing a lot of logical ORs (|
):
first_names %in% c("Andre", "Carly", "Dan")
[1] TRUE FALSE TRUE TRUE
which
gives the indices of TRUE
s in a logical vector:
which(first_names %in% c("Andre", "Carly", "Dan"))
[1] 1 3 4
Missing values are coded as NA
entries without quotes:
vector_w_missing <- c(1, 2, NA, 4, 5, 6, NA)
Even one NA
“poisons the well”: you'll get NA out of your calculations unless you remove them manually or with the extra argument na.rm = TRUE
(in some functions):
mean(vector_w_missing)
[1] NA
mean(vector_w_missing, na.rm=TRUE)
[1] 3.6
WARNING: you can't test for missing values by seeing if they “equal” (==
) NA
:
vector_w_missing == NA
[1] NA NA NA NA NA NA NA
But you can use the is.na
function:
is.na(vector_w_missing)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE
mean(vector_w_missing[!is.na(vector_w_missing)])
[1] 3.6
Sometimes we might get positive or negative infinity (\( \pm \infty \)) or NaN
(Not A Number) from our calculations:
c(-2, -1, 0, 1, 2) / 0
[1] -Inf -Inf NaN Inf Inf
You can check for these using functions like is.finite
or is.nan
.
is.finite(c(-2, -1, 0, 1, 2) / 0)
[1] FALSE FALSE FALSE FALSE FALSE
is.nan(c(-2, -1, 0, 1, 2) / 0)
[1] FALSE FALSE TRUE FALSE FALSE
Like with data frames, we can use head
and tail
to preview vectors:
head(letters) # letters is a built-in vector
[1] "a" "b" "c" "d" "e" "f"
head(letters, 10)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
tail(letters)
[1] "u" "v" "w" "x" "y" "z"
We can also index vectors by assigning names to the entries.
a_vector <- 1:26
names(a_vector) <- LETTERS # capital version of letters
head(a_vector)
A B C D E F
1 2 3 4 5 6
a_vector[c("R", "S", "T", "U", "D", "I", "O")]
R S T U D I O
18 19 20 21 4 9 15
Names are nice for subsetting because they don't depend on your data being in a certain order.
Matrices extend vectors to two dimensions: rows and columns.
(a_matrix <- matrix(letters[1:6], nrow=2, ncol=3))
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
(b_matrix <- matrix(letters[1:6], nrow=2, ncol=3, byrow=TRUE))
[,1] [,2] [,3]
[1,] "a" "b" "c"
[2,] "d" "e" "f"
We can also make matrices by binding vectors together.
(c_matrix <- cbind(c(1, 2), c(3, 4), c(5, 6)))
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
(d_matrix <- rbind(c(1, 2, 3), c(4, 5, 6)))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
We subset matrices using the same methods as with vectors, except we refer to [rows, columns]
:
a_matrix[1, 2] # row 1, column 2
[1] "c"
a_matrix[1, c(2, 3)] # row 1, columns 2 and 3
[1] "c" "e"
If a matrix ends up having just one row or column after subsetting, by default R will make it into a vector. You can prevent this behavior using drop=FALSE
.
a_matrix[, 1] # all rows, column 1, becomes a vector
[1] "a" "b"
a_matrix[, 1, drop=FALSE] # all rows, column 1, stays a matrix
[,1]
[1,] "a"
[2,] "b"
Matrices can be numeric, integer, factor, character, or logical, just like vectors. Also like vectors, they must be all the same data type.
(bad_matrix <- cbind(1:2, letters[1:2]))
[,1] [,2]
[1,] "1" "a"
[2,] "2" "b"
class(bad_matrix)
[1] "matrix"
In this case, everything was converted to character so as not to lose information.
We can access dimension names or name them ourselves:
rownames(bad_matrix) <- c("Harry", "Draco")
colnames(bad_matrix) <- c("Potions grade", "Quidditch position")
bad_matrix
Potions grade Quidditch position
Harry "1" "a"
Draco "2" "b"
bad_matrix["Draco", , drop=FALSE]
Potions grade Quidditch position
Draco "2" "b"
Matrices of the same dimensions can have math performed entry-wise with the usual arithmetic operators:
cbind(c_matrix, d_matrix) # look at side by side
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 3 5 1 2 3
[2,] 2 4 6 4 5 6
3 * c_matrix / d_matrix
[,1] [,2] [,3]
[1,] 3.0 4.5 5
[2,] 1.5 2.4 3
To do matrix transpositions, use t()
.
(e_matrix <- t(c_matrix))
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
To do proper matrix multiplication (not entry-wise), use %*%
.
(f_matrix <- d_matrix %*% e_matrix)
[,1] [,2]
[1,] 22 28
[2,] 49 64
To invert an invertible square matrix, use solve()
.
(g_matrix <- solve(f_matrix))
[,1] [,2]
[1,] 1.777778 -0.7777778
[2,] -1.361111 0.6111111
f_matrix %*% g_matrix
[,1] [,2]
[1,] 1 -3.552714e-15
[2,] 0 1.000000e+00
To extract the diagonal of a matrix or make a diagonal matrix (usually the identity matrix), use diag()
.
diag(2)
[,1] [,2]
[1,] 1 0
[2,] 0 1
diag(g_matrix)
[1] 1.7777778 0.6111111
Lists are an object that can store multiple types of data.
(my_list <- list("first_thing" = 1:5, "second_thing" = matrix(8:11, nrow = 2), "third_thing" = lm(dist ~ speed, data = cars)))
$first_thing
[1] 1 2 3 4 5
$second_thing
[,1] [,2]
[1,] 8 10
[2,] 9 11
$third_thing
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
You can access a list element by its name or number in [[]]
, or a $
with its name:
my_list[["first_thing"]]
[1] 1 2 3 4 5
my_list$first_thing
[1] 1 2 3 4 5
my_list[[1]]
[1] 1 2 3 4 5
If you use one bracket to access list elements, you get a sublist back. The double brackets get the actual element in that location in the list.
str(my_list[1])
List of 1
$ first_thing: int [1:5] 1 2 3 4 5
str(my_list[[1]])
int [1:5] 1 2 3 4 5
You can use vector-style subsetting to get a sublist:
length(my_list[c(1, 2)])
[1] 2
str(my_list[c(1, 2)])
List of 2
$ first_thing : int [1:5] 1 2 3 4 5
$ second_thing: int [1:2, 1:2] 8 9 10 11
str(my_list[[3]])
List of 12
$ coefficients : Named num [1:2] -17.58 3.93
..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
$ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
$ effects : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "speed"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.14 1.27
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 48
$ xlevels : Named list()
$ call : language lm(formula = dist ~ speed, data = cars)
$ terms :Classes 'terms', 'formula' length 3 dist ~ speed
.. ..- attr(*, "variables")= language list(dist, speed)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "dist" "speed"
.. .. .. ..$ : chr "speed"
.. ..- attr(*, "term.labels")= chr "speed"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(dist, speed)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
$ model :'data.frame': 50 obs. of 2 variables:
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 dist ~ speed
.. .. ..- attr(*, "variables")= language list(dist, speed)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "dist" "speed"
.. .. .. .. ..$ : chr "speed"
.. .. ..- attr(*, "term.labels")= chr "speed"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(dist, speed)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
- attr(*, "class")= chr "lm"
names(my_list[[3]])
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
my_list[[3]][["coefficients"]]
(Intercept) speed
-17.579095 3.932409
(speed_beta <- my_list[[3]][["coefficients"]]["speed"])
speed
3.932409
summary(lm_object)
is also a list with more information, which has the side effect of printing some output to the console:
summary(my_list[[3]]) # this prints output
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
summary(my_list[[3]])[["coefficients"]] # a matrix
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.579095 6.7584402 -2.601058 1.231882e-02
speed 3.932409 0.4155128 9.463990 1.489836e-12
(speed_SE <- summary(my_list[[3]])[["coefficients"]]["speed", "Std. Error"])
[1] 0.4155128
speed_CI <- speed_beta + c(-qnorm(0.975), qnorm(0.975)) * speed_SE
names(speed_CI) <- c("lower", "upper")
Now you can include these values in a Markdown document:
A 1 mph increase in speed is associated with a `r round(speed_beta, 1)` ft increase in stopping distance (95% CI: (`r round(speed_CI["lower"],1)`, `r round(speed_CI["upper"],1)`)).
A 1 mph increase in speed is associated with a 3.9 ft increase in stopping distance (95% CI: (3.1, 4.7)).
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
length(cars)
[1] 2
length(cars$dist) # should be same as nrow(cars)
[1] 50
cars[1, ]
speed dist
1 4 2
cars[1:5, "speed", drop = FALSE]
speed
1 4
2 4
3 7
4 7
5 8
Two ways of calculating the same thing: which do you like better?
Classic R:
mean(swiss[swiss$Agriculture > mean(swiss$Agriculture), "Fertility"])
dplyr
:
library(dplyr)
swiss %>%
filter(Agriculture > mean(Agriculture)) %>%
select(Fertility) %>%
summarize(mean = mean(Fertility))
You can do interactive R tutorials in swirl
that cover these structure basics. To set up swirl
:
install.packages("swirl")
library("swirl")
swirl()
R Programming
, pick a tutorial, and follow directionsswirl
, type bye()
in the middle of a lesson, or 0
in the menusAt this point, tutorials 1-8 are appropriate.
For homework, you'll be filling in a template R Markdown file that will walk you through performing multiple linear regression “by hand” and comparing it with what you get using lm()
. It will involve simulating data (which I will do for you), matrix math, column and row names, accessing list elements.