Rebecca Ferrell
April 13, 2016
Today we'll talk about dplyr: a package that does in R just about any calculation you've tried to do in Excel, but more transparently, reproducibly, and safely. Don't be the sad research assistant who made this mistake (Reinhart and Rogoff):

Recall last week we used the filter command to subset data like so:
library(dplyr)
library(gapminder)
Canada <- gapminder %>%
    filter(country == "Canada")
Excel analogue:

Common use case: want to filter rows to things in some set. The c() function (combine, concatenate) is how we make vectors in R, which is an important data type. (We'll talk more about those next week.)
former_yugoslavia <- c("Bosnia and Herzegovina", "Croatia", "Macedonia", "Montenegro", "Serbia", "Slovenia")
yugoslavia <- gapminder %>%
    filter(country %in% former_yugoslavia)
tail(yugoslavia, 2)
Source: local data frame [2 x 6]
   country continent  year lifeExp     pop gdpPercap
    (fctr)    (fctr) (int)   (dbl)   (int)     (dbl)
1 Slovenia    Europe  2002  76.660 2011497  20660.02
2 Slovenia    Europe  2007  77.926 2009245  25768.26
You can see all the values in your data for columns using distinct:
gapminder %>% distinct(continent, year)
Source: local data frame [60 x 6]
       country continent  year lifeExp      pop gdpPercap
        (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
1  Afghanistan      Asia  1952  28.801  8425333  779.4453
2  Afghanistan      Asia  1957  30.332  9240934  820.8530
3  Afghanistan      Asia  1962  31.997 10267083  853.1007
4  Afghanistan      Asia  1967  34.020 11537966  836.1971
5  Afghanistan      Asia  1972  36.088 13079460  739.9811
6  Afghanistan      Asia  1977  38.438 14880372  786.1134
7  Afghanistan      Asia  1982  39.854 12881816  978.0114
8  Afghanistan      Asia  1987  40.822 13867957  852.3959
9  Afghanistan      Asia  1992  41.674 16317921  649.3414
10 Afghanistan      Asia  1997  41.763 22227415  635.3414
..         ...       ...   ...     ...      ...       ...
The dplyr package makes use of an operator (not native to R) called a pipe. We write pipes like %>% (Cntl-Shift-M). Pipes take the object on the left and apply the function on the right: x %>% f(y) = f(x, y). Read out loud: “and then…”
gapminder %>% filter(country == "Canada") %>% head(2)
Source: local data frame [2 x 6]
  country continent  year lifeExp      pop gdpPercap
   (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
1  Canada  Americas  1952   68.75 14785584  11367.16
2  Canada  Americas  1957   69.96 17010154  12489.95
Pipes save us typing and allow chaining like above, so we use them all the time when manipulating data frames.
take_this_data %>%
    do_first_thing(with = this_value) %>%
    do_next_thing(using = that_value) %>% ...
Stuff to the left of the pipe is passed to the first argument of the function on the right. Other arguments go on the right in the function.
If you ever find yourself piping a function where the data is not the first argument, use . in the data argument instead.
yugoslavia %>% lm(pop ~ year, data = .)
We can also filter at random to work with a smaller dataset using sample_n or sample_frac.
set.seed(0413) # makes random numbers repeatable
yugoslavia %>% sample_n(size = 6, replace = FALSE)
Source: local data frame [6 x 6]
     country continent  year lifeExp     pop gdpPercap
      (fctr)    (fctr) (int)   (dbl)   (int)     (dbl)
1 Montenegro    Europe  1962  63.728  474528  4649.594
2 Montenegro    Europe  1982  74.101  562548 11222.588
3     Serbia    Europe  1962  64.531 7616060  6289.629
4   Slovenia    Europe  1952  65.570 1489518  4215.042
5     Serbia    Europe  1952  57.996 6860147  3581.459
6    Croatia    Europe  1997  73.680 4444595  9875.605
Along with filtering the data to see certain rows, we might want to sort it:
yugoslavia %>% arrange(year, desc(pop))
Source: local data frame [60 x 6]
                  country continent  year lifeExp     pop gdpPercap
                   (fctr)    (fctr) (int)   (dbl)   (int)     (dbl)
1                  Serbia    Europe  1952  57.996 6860147 3581.4594
2                 Croatia    Europe  1952  61.210 3882229 3119.2365
3  Bosnia and Herzegovina    Europe  1952  53.820 2791000  973.5332
4                Slovenia    Europe  1952  65.570 1489518 4215.0417
5              Montenegro    Europe  1952  59.164  413834 2647.5856
6                  Serbia    Europe  1957  61.685 7271135 4981.0909
7                 Croatia    Europe  1957  64.770 3991242 4338.2316
8  Bosnia and Herzegovina    Europe  1957  58.450 3076000 1353.9892
9                Slovenia    Europe  1957  67.850 1533070 5862.2766
10             Montenegro    Europe  1957  61.448  442829 3682.2599
..                    ...       ...   ...     ...     ...       ...
Not only can we limit rows, but we can limit columns (and put them in the order listed) using select. 
yugoslavia %>% select(country, year, pop) %>% head(4)
Source: local data frame [4 x 3]
                 country  year     pop
                  (fctr) (int)   (int)
1 Bosnia and Herzegovina  1952 2791000
2 Bosnia and Herzegovina  1957 3076000
3 Bosnia and Herzegovina  1962 3349000
4 Bosnia and Herzegovina  1967 3585000
We can instead drop columns with select using - signs:
yugoslavia %>% select(-continent, -pop, -lifeExp) %>% head(4)
Source: local data frame [4 x 3]
                 country  year gdpPercap
                  (fctr) (int)     (dbl)
1 Bosnia and Herzegovina  1952  973.5332
2 Bosnia and Herzegovina  1957 1353.9892
3 Bosnia and Herzegovina  1962 1709.6837
4 Bosnia and Herzegovina  1967 2172.3524
select() has a variety of helper functions like starts_with, ends_with, and contains, or giving a range of continguous columns startvar:endvar. These are very useful if you have a “wide” data frame with column names following a pattern or ordering. See ?select.
(US Dept. of Education “College Scorecard” data: > 100 columns)
We can rename columns using select, but that drops everything that isn't mentioned:
yugoslavia %>%
    select(Life_Expectancy = lifeExp) %>%
    head(4)
Source: local data frame [4 x 1]
  Life_Expectancy
            (dbl)
1           53.82
2           58.45
3           61.93
4           64.79
yugoslavia %>%
    select(country, year, lifeExp) %>%
    rename(Life_Expectancy = lifeExp) %>%
    head(4)
Source: local data frame [4 x 3]
                 country  year Life_Expectancy
                  (fctr) (int)           (dbl)
1 Bosnia and Herzegovina  1952           53.82
2 Bosnia and Herzegovina  1957           58.45
3 Bosnia and Herzegovina  1962           61.93
4 Bosnia and Herzegovina  1967           64.79
Good column names will be self-describing. Don't use inscrutable abbreviations to save typing. RStudio's autocompleting functions take away the pain of long variable names: hit tab while writing code.
Valid “naked” column names can contain upper or lowercase letters, numbers, periods, and underscores. They must start with a letter or period and not be a special reserved word (e.g. TRUE, if).
Names are case-sensitive: Year and year are not the same thing!
You can include spaces or use reserved words if you put backticks around the name. Spaces can be worth including when preparing data for ggplot2 or pander since you don't have to rename axes or table headings.
library(pander)
yugoslavia %>% filter(country == "Serbia") %>%
    select(year, lifeExp) %>%
    rename(Year = year, `Life Expectancy` = lifeExp) %>%
    head(5) %>%
    pander(style = "rmarkdown", caption = "Serbian life expectancy")
| Year | Life Expectancy | 
|---|---|
| 1952 | 57.996 | 
| 1957 | 61.685 | 
| 1962 | 64.531 | 
| 1967 | 66.914 | 
| 1972 | 68.700 | 
Table: Serbian life expectancy
Thing you do in spreadsheets: add column to data, drag down.

dplyr way: add new columns to a data frame using mutate. (Add new columns and drop old ones using transmute.)
yugoslavia %>% filter(country == "Serbia") %>%
    select(year, pop, lifeExp) %>%
    mutate(pop_million = pop / 1000000,
           life_exp_past_40 = lifeExp - 40) %>%
    head(5)
Source: local data frame [5 x 5]
   year     pop lifeExp pop_million life_exp_past_40
  (int)   (int)   (dbl)       (dbl)            (dbl)
1  1952 6860147  57.996    6.860147           17.996
2  1957 7271135  61.685    7.271135           21.685
3  1962 7616060  64.531    7.616060           24.531
4  1967 7971222  66.914    7.971222           26.914
5  1972 8313288  68.700    8.313288           28.700
A common function used in mutate (and in general in R programming) is ifelse(). This returns a value depending on logical tests.
yugoslavia %>%
    mutate(short_country = ifelse(country == "Bosnia and Herzegovina", "B and H", as.character(country))) %>%
    select(short_country, year, pop) %>%
    arrange(year, short_country) %>%
    head(3)
Source: local data frame [3 x 3]
  short_country  year     pop
          (chr) (int)   (int)
1       B and H  1952 2791000
2       Croatia  1952 3882229
3    Montenegro  1952  413834
summarize takes your rows of data and computes something across them: count how many rows there are, calculate the mean or total, etc. You can use any function that aggregates multiple values into a single one (like sd).
In a spreadsheet:

yugoslavia %>%
    filter(year == 1982) %>%
    summarize(n_obs = n(),
              total_pop = sum(pop),
              mean_life_exp = mean(lifeExp),
              range_life_exp = max(lifeExp) - min(lifeExp))
Source: local data frame [1 x 4]
  n_obs total_pop mean_life_exp range_life_exp
  (int)     (int)         (dbl)          (dbl)
1     5  20042685       71.2952          3.939
Maybe you need to calculate the mean and standard deviation of a bunch of columns. With summarize_each(), you put the functions to use in a funs() list, and the variables to compute over after that (like select syntax).
yugoslavia %>%
    filter(year == 1982) %>%
    summarize_each(funs(mean, sd),
                   lifeExp, pop)
Source: local data frame [1 x 4]
  lifeExp_mean pop_mean lifeExp_sd  pop_sd
         (dbl)    (dbl)      (dbl)   (dbl)
1      71.2952  4008537   1.602685 3237282
The special function group_by() changes how functions operate on the data, most importantly summarize. These functions are computed within each group as defined by variables given, rather than over all rows at once. Typically the variables you group by will be integers, factors, or characters, and not continuous real values.
Excel analogue: pivot tables

yugoslavia %>% group_by(year) %>%
    summarize(num_countries = n_distinct(country),
              total_pop = sum(pop),
              total_gdp_per_cap = sum(pop * gdpPercap) / total_pop) %>%
    head(5)
Source: local data frame [5 x 4]
   year num_countries total_pop total_gdp_per_cap
  (int)         (int)     (int)             (dbl)
1  1952             5  15436728          3029.794
2  1957             5  16314276          4187.491
3  1962             5  17099107          5256.578
4  1967             5  17878535          6655.827
5  1972             5  18579786          8730.215
Grouping can also be used with mutate or filter to give rank orders within a group, lagged values, and cumulative sums. Much more on window functions is in a vignette.
yugoslavia %>% select(country, year, pop) %>%
    filter(year >= 2002) %>% group_by(country) %>%
    mutate(lag_pop = lag(pop, order_by = year),
           pop_chg = pop - lag_pop) %>% head(4)
Source: local data frame [4 x 5]
Groups: country [2]
                 country  year     pop lag_pop pop_chg
                  (fctr) (int)   (int)   (int)   (int)
1 Bosnia and Herzegovina  2002 4165416      NA      NA
2 Bosnia and Herzegovina  2007 4552198 4165416  386782
3                Croatia  2002 4481020      NA      NA
4                Croatia  2007 4493312 4481020   12292
With the Gapminder data, practice the following analyses:
ifelse()Excel equivalents: VLOOKUP, MATCH

We need to think about the following when we want to merge data frames A and B:
A %>% inner_join(B): keep rows from A that match rows in B, columns from both A and BA %>% left_join(B): keep all rows from A, matched with B wherever possible (NA when not), columns from both A and BA %>% right_join(B): keep all rows from B, matched with A wherever possible (NA when not), columns from both A and BA %>% full_join(B): keep all rows from either A or B, matched wherever possible (NA when not), columns from both A and BA %>% semi_join(B): keep rows from A that match rows in B, columns from just AA %>% anti_join(B): keep rows from A that don't match a row in B, columns from just AWe say rows should match because they have some columns containing the same value. We list these in a by = argument to the join.
by: matches using all variables in A and B that have identical namesby = c("var1", "var2", "var3"): matches on identical values of var1, ,var2, var3 in both A and Bby = c("Avar1" = "Bvar1", "Avar2" = "Bvar2"): matches identical values of Avar1 variable in A to Bvar1 variable in B, and Avar2 variable in A to Bvar2 variable in BNote: if there are multiple matches, you'll get one row for each possible combo (except with semi_join and anti_join).
(Need to get more complicated? You'll want to learn SQL.)
We'll use data in the nycflights13 package. Install and load it:
# install.packages("nycflights13")
library(nycflights13)
It includes five tables, some of which contain missing data (NA):
flights: flights leaving JFK, LGA, or EWR in 2013airlines: airline abbreviationsairports: airport metadataplanes: airplane metadataweather: hourly weather data for JFK, LGA, and EWRWho manufactures the planes that flew to Seattle?
flights %>% filter(dest == "SEA") %>% select(tailnum) %>%
    left_join(planes %>% select(tailnum, manufacturer), by = "tailnum") %>%
    distinct(manufacturer)
Source: local data frame [6 x 2]
  tailnum       manufacturer
    (chr)              (chr)
1  N594AS             BOEING
2  N503JB   AIRBUS INDUSTRIE
3  N3ETAA                 NA
4  N712JB             AIRBUS
5  N508JB CIRRUS DESIGN CORP
6  N531JB      BARKER JACK L
Which airlines had the most flights to Seattle from NYC?
flights %>% filter(dest == "SEA") %>% select(carrier) %>%
    left_join(airlines, by = "carrier") %>%
    group_by(name) %>% tally() %>% arrange(desc(n))
Source: local data frame [5 x 2]
                    name     n
                  (fctr) (int)
1   Delta Air Lines Inc.  1213
2  United Air Lines Inc.  1117
3   Alaska Airlines Inc.   714
4        JetBlue Airways   514
5 American Airlines Inc.   365
Is there a relationship between departure delays and wind gusts?
library(ggplot2)
flights %>% select(origin, year, month, day, hour, dep_delay) %>%
    inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
    select(dep_delay, wind_gust) %>%
    # removing rows with missing values
    filter(!is.na(dep_delay) & !is.na(wind_gust)) %>% 
    ggplot(aes(x = wind_gust, y = dep_delay)) +
    geom_point() + geom_smooth()
flights %>% select(origin, year, month, day, hour, dep_delay) %>%
    inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
    select(dep_delay, wind_gust) %>%
    filter(!is.na(dep_delay) & !is.na(wind_gust) & wind_gust < 250) %>% 
    ggplot(aes(x = wind_gust, y = dep_delay)) +
    geom_smooth() + theme_bw(base_size = 16) +
    xlab("Wind gusts in departure hour (mph)") +
    ylab("Average departure delay (minutes)")
Some possible questions to investigate:
airports to convert flight arrival times to NYC local time.
Warning! flights has 336776 rows, so if you do a sloppy join, you can end up with many matches per observation and have the data blow up.
Pick something to look at in the nycflights13 data and write up a .Rmd file showing your investigation. Upload both the .Rmd file and the .html file to Canvas. You must use at least once: mutate, summarize, group_by, and joins. Include at least one formatted plot or table.
This time, include all your code in your output document, using comments and line breaks separating commands so that it is clear to a peer what you are doing. You must write up your observations in words as well.