Rebecca Ferrell
May 11, 2016
We have a vector of numbers, and we want to add 1 to each entry.
my_vector <- rnorm(1000000)
A for
loop works but is relatively slow:
for_start <- proc.time() # start the clock
new_vector <- rep(NA, length(my_vector))
for(position in 1:length(my_vector)) {
new_vector[position] <- my_vector[position] + 1
}
(for_time <- proc.time() - for_start) # time elapsed
user system elapsed
2.057 0.039 2.180
Recognize that we can instead use R's vector addition (with recycling):
vec_start <- proc.time()
new_vector <- my_vector + 1
(vec_time <- proc.time() - vec_start)
user system elapsed
0.057 0.007 0.064
for_time / vec_time
user system elapsed
36.087719 5.571429 34.062500
Vector/matrix arithmetic is implemented using fast, optimized functions that a for
loop can't compete with.
rowSums
, colSums
, rowMeans
, colMeans
give sums or averages over rows or columns of matrices/data frames(a_matrix <- matrix(1:12, nrow = 3, ncol = 4))
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
rowSums(a_matrix)
[1] 22 26 30
Remember when Goofus tried find the mean for each variable in the swiss
data? The most Gallant solution is to just use colMeans
without even thinking about pre-allocation or for
loops:
colMeans(swiss)
Fertility Agriculture Examination Education
70.14255 50.65957 16.48936 10.97872
Catholic Infant.Mortality
41.14383 19.94255
cumsum
, cumprod
, cummin
, cummax
give back a vector with cumulative quantities (e.g. running totals)cumsum(1:7)
[1] 1 3 6 10 15 21 28
pmax
and pmin
take a matrix or set of vectors, output the min or max for each position (after recycling):pmax(c(0, 2, 4), c(1, 1, 1), c(2, 2, 2))
[1] 2 2 4
Use rnorm
to randomly generate a vector of length 10 million
Using a for
loop and timing your code, calculate another vector of length 10 million that gives the running total
Repeat using cumsum
instead of a loop
Did you get the same result both ways? What's the difference in speed? Which is easier to read?
mean
:
dplyr::filter
:
readr::read_csv
:
Functions can encapsulate repeated actions such as:
NA
ggplot
s used in reportsAdvanced uses for functions (not covered in this class):
Let's look at a function that takes a vector as input and outputs a named vector of the first and last elements:
first_and_last <- function(x) {
first <- x[1]
last <- x[length(x)]
return(c("first" = first, "last" = last))
}
Test it out:
first_and_last(c(4, 3, 1, 8))
first last
4 8
What if I give first_and_last
a vector of length 1?
first_and_last(7)
first last
7 7
Of length 0?
first_and_last(numeric(0))
first
NA
Maybe we want it to be a little smarter.
Let's make sure we get an error message is the vector is too small:
smarter_first_and_last <- function(x) {
if(length(x) == 0L) { # specify integers with L
stop("The input has no length!")
} else {
first <- x[1]
last <- x[length(x)]
return(c("first" = first, "last" = last))
}
}
smarter_first_and_last(numeric(0))
Error in smarter_first_and_last(numeric(0)): The input has no length!
smarter_first_and_last(c(4, 3, 1, 8))
first last
4 8
If you type the function name without any parentheses or arguments, you can see its guts:
smarter_first_and_last
function(x) {
if(length(x) == 0L) { # specify integers with L
stop("The input has no length!")
} else {
first <- x[1]
last <- x[length(x)]
return(c("first" = first, "last" = last))
}
}
x
or na.rm
in my_new_func <- function(x, na.rm = FALSE) {...}
na.rm = FALSE
is example of setting a default value: if user doesn't say what na.rm
is, it'll be FALSE
x
, na.rm
values won't exist in R outside of the functionreturn()
. Could be a vector, list, data frame, another function, or even nothing
Maybe you want to know more detailed quantile information than summary
gives you with interpretable names. Here's a starting point:
quantile_report <- function(x, na.rm = FALSE) {
quants <- quantile(x, probs = c(0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99), na.rm = na.rm)
names(quants) <- c("Bottom 1%", "Bottom 5%", "Bottom 10%", "Bottom 25%", "Median", "Top 25%", "Top 10%", "Top 5%", "Top 1%")
return(quants)
}
quantile_report(rnorm(10000))
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median
-2.335569889 -1.660062418 -1.275845494 -0.688747891 -0.007850475
Top 25% Top 10% Top 5% Top 1%
0.675108750 1.280428326 1.647617563 2.331385793
lapply
applys a function over a list of any kind (e.g. a data frame), and returns a list. This is a lot easier than preparing a for
loop!
lapply(swiss, FUN = quantile_report)
$Fertility
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
38.588 47.580 56.240 64.700 70.400 78.450
Top 10% Top 5% Top 1%
84.600 90.670 92.454
$Agriculture
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
4.190 15.650 17.360 35.900 54.100 67.650
Top 10% Top 5% Top 1%
76.820 84.810 87.952
$Examination
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
3.00 5.00 6.00 12.00 16.00 22.00
Top 10% Top 5% Top 1%
26.00 30.40 36.08
$Education
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
1.46 2.00 3.00 6.00 8.00 12.00
Top 10% Top 5% Top 1%
23.20 29.00 43.34
$Catholic
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
2.2052 2.4480 2.8320 5.1950 15.1400 93.1250
Top 10% Top 5% Top 1%
99.0000 99.6140 99.8666
$Infant.Mortality
Bottom 1% Bottom 5% Bottom 10% Bottom 25% Median Top 25%
12.778 15.600 16.420 18.150 20.000 21.700
Top 10% Top 5% Top 1%
23.680 24.470 25.818
A downside to lapply
is that lists are not always natural to work with. sapply
will simplify the list output by converting each list element to a column in a matrix:
sapply(swiss, FUN = quantile_report)
Fertility Agriculture Examination Education Catholic
Bottom 1% 38.588 4.190 3.00 1.46 2.2052
Bottom 5% 47.580 15.650 5.00 2.00 2.4480
Bottom 10% 56.240 17.360 6.00 3.00 2.8320
Bottom 25% 64.700 35.900 12.00 6.00 5.1950
Median 70.400 54.100 16.00 8.00 15.1400
Top 25% 78.450 67.650 22.00 12.00 93.1250
Top 10% 84.600 76.820 26.00 23.20 99.0000
Top 5% 90.670 84.810 30.40 29.00 99.6140
Top 1% 92.454 87.952 36.08 43.34 99.8666
Infant.Mortality
Bottom 1% 12.778
Bottom 5% 15.600
Bottom 10% 16.420
Bottom 25% 18.150
Median 20.000
Top 25% 21.700
Top 10% 23.680
Top 5% 24.470
Top 1% 25.818
There's also a function just called apply
that works over matrices or data frames. You tell it whether to apply the function to each row (MARGIN = 1
) or column (MARGIN = 2
).
apply(swiss, MARGIN = 2, FUN = quantile_report)
Fertility Agriculture Examination Education Catholic
Bottom 1% 38.588 4.190 3.00 1.46 2.2052
Bottom 5% 47.580 15.650 5.00 2.00 2.4480
Bottom 10% 56.240 17.360 6.00 3.00 2.8320
Bottom 25% 64.700 35.900 12.00 6.00 5.1950
Median 70.400 54.100 16.00 8.00 15.1400
Top 25% 78.450 67.650 22.00 12.00 93.1250
Top 10% 84.600 76.820 26.00 23.20 99.0000
Top 5% 90.670 84.810 30.40 29.00 99.6140
Top 1% 92.454 87.952 36.08 43.34 99.8666
Infant.Mortality
Bottom 1% 12.778
Bottom 5% 15.600
Bottom 10% 16.420
Bottom 25% 18.150
Median 20.000
Top 25% 21.700
Top 10% 23.680
Top 5% 24.470
Top 1% 25.818
Maybe you often want to bucket variables in your data into groups based on quantiles:
Person | Income | Income Bucket |
---|---|---|
1 | 8000 | 1 |
2 | 103000 | 3 |
3 | 12000 | 1 |
4 | 52000 | 2 |
5 | 150000 | 3 |
6 | 45000 | 2 |
There's already a function in R called cut
that does this, but you need to tell it cutpoints or the number of buckets. Let's make a convenience function that calls cut
using quantiles for splitting and returns an integer:
bucket <- function(x, quants = c(0.333, 0.667)) {
# set low extreme, quantile points, high extreme
new_breaks <- c(min(x)-1, quantile(x, probs = quants), max(x)+1)
# labels = FALSE will return integer codes instead of ranges
return(cut(x, breaks = new_breaks, labels = FALSE))
}
rando_data <- rnorm(100)
bucketed_rando_data <- bucket(rando_data, quants = c(0.05, 0.25, 0.5, 0.75, 0.95))
plot(x = bucketed_rando_data, y = rando_data,
main = "Buckets and values")
Let's say we have data where impossible values occur:
(school_data <- data.frame(school = letters[1:10], pct_passing_exam = c(0.78, 0.55, 0.91, -1, 0.88, 0.81, 0.90, 0.76, 999, 999), pct_free_lunch = c(0.33, 999, 0.25, 0.05, 0.12, 0.09, 0.22, -13, 0.21, 999)))
school pct_passing_exam pct_free_lunch
1 a 0.78 0.33
2 b 0.55 999.00
3 c 0.91 0.25
4 d -1.00 0.05
5 e 0.88 0.12
6 f 0.81 0.09
7 g 0.90 0.22
8 h 0.76 -13.00
9 i 999.00 0.21
10 j 999.00 999.00
x
, cutoff for low
, cutoff for high
NA
in the extreme placesremove_extremes <- function(x, low, high) {
x_no_low <- ifelse(x < low, NA, x)
x_no_low_no_high <- ifelse(x_no_low > high, NA, x)
return(x_no_low_no_high)
}
remove_extremes(school_data$pct_passing_exam, low = 0, high = 1)
[1] 0.78 0.55 0.91 NA 0.88 0.81 0.90 0.76 NA NA
dplyr
functions summarize_each
and mutate_each
take an argument funs()
. This will apply our function to every variable (besides school
) to update the columns in school_data
:
library(dplyr)
school_data %>%
mutate_each(funs(remove_extremes(x = ., low = 0, high = 1)),
-school)
school pct_passing_exam pct_free_lunch
1 a 0.78 0.33
2 b 0.55 NA
3 c 0.91 0.25
4 d NA 0.05
5 e 0.88 0.12
6 f 0.81 0.09
7 g 0.90 0.22
8 h 0.76 NA
9 i NA 0.21
10 j NA NA
In the data downloading demo, I had this block of code to fix ID values that were mangled like "12345678.000000"
:
CA_OSHPD_util <- CA_OSHPD_util %>%
# translating the regular expression pattern:
# \\. matches the location of the period.
# 0+ matches at least one zero and possibly more following that period.
# replacement for period + 0s is nothing (empty string)
mutate_each(funs(gsub(pattern = "\\.0+",
x = .,
replacement = "")),
# variables to fix
FAC_NO, HSA, HFPA)
dplyr
uses what is called non-standard evaluation that lets you refer to “naked” variables (no quotes around them) like FAC_NO, HSA, HFPA
. There are standard evaluation versions of dplyr
functions that use the quoted versions instead, which can sometimes be more convenient. These end in an underscore (_
).
Example converting character data to dates from the data downloading demo:
yearly_frames[[i]] <- yearly_frames[[i]] %>%
# cnames is a character vector of var names
# 4th and 5th variables are strings to become dates
mutate_each_(funs(mdy), cnames[4:5])
You can skip naming your function in dplyr
if you won't use it again. Code below will return the mean divided by the standard deviation for each variable in swiss
:
swiss %>%
summarize_each(funs( mean(., na.rm = TRUE) / sd(., na.rm = TRUE) ))
Fertility Agriculture Examination Education Catholic Infant.Mortality
1 5.615134 2.230597 2.066884 1.141785 0.9865478 6.846766
Like with dplyr
, you can use anonymous functions in lapply
, but a difference is you'll need to have the function
part at the beginning:
lapply(swiss, function(x) mean(x, na.rm = TRUE) / sd(x, na.rm = TRUE))
$Fertility
[1] 5.615134
$Agriculture
[1] 2.230597
$Examination
[1] 2.066884
$Education
[1] 1.141785
$Catholic
[1] 0.9865478
$Infant.Mortality
[1] 6.846766
Let's say you have a particular way you like your charts:
library(gapminder); library(ggplot2)
ggplot(gapminder %>% filter(country == "Afghanistan"),
aes(x = year, y = pop / 1000000)) +
geom_line(color = "firebrick") +
xlab(NULL) + ylab("Population (millions)") +
ggtitle("Population of Afghanistan since 1952") +
theme_minimal() + theme(text = element_text(family = "Times"), plot.title = element_text(hjust = 0, size = 20))
gapminder
variable?We can have the user input a character string for cntry
as an argument to the function to get subsetting and the title right:
gapminder_lifeplot <- function(cntry) {
ggplot(gapminder %>% filter(country == cntry),
aes(x = year, y = lifeExp)) +
geom_line(color = "firebrick") +
xlab(NULL) + ylab("Life expectancy") +
ggtitle(paste0("Life expectancy in ", cntry, " since 1952")) +
theme_minimal() + theme(text = element_text(family = "Times"), plot.title = element_text(hjust = 0, size = 20))
}
gapminder_lifeplot(cntry = "Turkey")
Now let's allow the user to say which variable they want plotted on the y-axis. First, let's think about how we can get the right labels for the axis and title. A named character vector to serve as a “lookup table” inside the function would work:
y_axis_label <- c("lifeExp" = "Life expectancy",
"pop" = "Population (millions)",
"gdpPercap" = "GDP per capita, USD")
title_text <- c("lifeExp" = "Life expectancy in ",
"pop" = "Population of ",
"gdpPercap" = "GDP per capita in ")
# example use:
y_axis_label["pop"]
pop
"Population (millions)"
ggplot
is usually looking for “naked” variables, but we can tell it to take them as quoted strings using aes_string
instead of aes
, which is handy when making functions:
gapminder_plot <- function(cntry, yvar) {
y_axis_label <- c("lifeExp" = "Life expectancy", "pop" = "Population (millions)", "gdpPercap" = "GDP per capita, USD")[yvar]
title_text <- c("lifeExp" = "Life expectancy in ", "pop" = "Population of ", "gdpPercap" = "GDP per capita in ")[yvar]
ggplot(gapminder %>% filter(country == cntry) %>% mutate(pop = pop / 1000000), aes_string(x = "year", y = yvar)) + geom_line(color = "firebrick") + xlab(NULL) + ylab(y_axis_label) + ggtitle(paste0(title_text, cntry, " since 1952")) + theme_minimal() + theme(text = element_text(family = "Times"), plot.title = element_text(hjust = 0, size = 20))
}
gapminder_plot(cntry = "Turkey", yvar = "pop")
Something not working as hoped? Try using debug
on a function, which will show you the world as perceived from inside the function:
debug(gapminder_plot)
Then when you've fixed your problem, use undebug
so that you won't go into debug mode every time you run it:
undebug(gapminder_plot)
DUE IN TWO WEEKS: download and analyze data from the first year of Seattle's Pronto! bike sharing program.
You'll be writing a loop to read in the data, functions to clean it up, and another function to visualize ridership over the first year.
There is some string processing needed, much of which you have already seen or can probably Google, but some will come in next week's lecture.