Here are some excerpts of the reports students prepared for the Week 3 homework on dplyr practice using the nycflights13 data. All have been edited slightly for formatting or to consolidate code since in many cases there was exploratory data analysis that preceded these results.

library(ggplot2)
library(dplyr)
library(nycflights13)
library(pander)

1 Weather and departure delays

Many students explored analyses related to the precipitation patterns we looked at in class. Here are a couple of nice examples of those.

1.1 Visibility

This analysis involves changing the departure delay variable into units of hours, merging the weather data onto the flights to get visibility information (matching airport, day, and hour), and then looking at a non-parametric estimate of the mean departure delay as a function of visibility. This reveals an interesting pattern at NYC area airports: generally, better visibility is associated with shorter departure delays on average, but delays in lowest visibility conditions (under 1 miles, so “fog” or “mist” by international standards) are typically shorter than with slightly better visibility. This pattern could be worth looking into in more detail.

flights %>%
    mutate(dep_delay_hr = dep_delay/60) %>%
    select(origin, year, month, day, hour, dep_delay_hr) %>%
    inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
    select(dep_delay_hr, visib) %>%
    filter(!is.na(dep_delay_hr) & !is.na(visib)) %>% 
    ggplot(aes(x = visib, y = dep_delay_hr)) +
    geom_smooth() +
    theme_bw(base_size = 16) +
    xlab("Visibility (miles)") +
    ylab("Average departure delay (hours)") +
    ggtitle("Trend in mean departure delay by visibility") +
    theme(plot.title=element_text(size=12))

1.2 Pressure

This student noted that high air pressure is often associated with nice weather, and we do see that as pressure increases, departure delays are typically shorter in the nycflights13 data.

flights %>%
    select(origin, dest, year, month, day, hour, dep_delay, arr_delay) %>%
    inner_join(weather, by = c("origin", "year", "month", "day", "hour")) %>%
    select(dep_delay, pressure) %>%
    filter(!is.na(dep_delay) & !is.na(pressure)) %>% 
    ggplot(aes(x = pressure, y = dep_delay)) +
    geom_smooth() +
    theme_bw(base_size = 10) +
    xlab("Sea-level Pressure at departure hour (millibars)") +
    ylab("Average departure delay (minutes)") +
    ggtitle("Impact of Sea-level Pressure on Departure Delays")

2 NYC-SAN speeds by carrier

This analysis looks at the airlines that fly to San Diego from New York, calculates how fast these flights were, and compares them across carriers. There is more variation in flight speeds than one might expect on ths common route, and some airlines appear to be faster than others.

flights %>%
    filter(dest == 'SAN') %>%
    left_join(planes, by = "tailnum") %>%
    left_join(airlines, by = "carrier") %>%
    mutate(milesperminute = distance / air_time,
           milesperhour = milesperminute * 60) %>%
    ggplot(aes(x = carrier, y = milesperhour, fill = carrier)) +
    geom_boxplot() +
    guides(fill=FALSE) +
    ggtitle("New York-San Diego: Speeds by Carrier") +
    ylab("Miles Per Hour") +
    coord_flip() +
    theme(axis.title.y = element_blank(),
          panel.grid.minor=element_blank(),
          panel.grid.major=element_blank()) +
    scale_x_discrete(breaks=c("AA", "B6", "DL", "UA"),
                     labels=c("American Airlines", "JetBlue Airways",
                              "Delta Air Lines", "United Airlines"))

3 United’s delays by NYC airport

This table looks at United Airline’s track record for departure and arrival delays leaving from the three NYC airports. It turns out that almost half of all United Airlines flights passing through EWR depart late, which appears to be substantially higher than JFK and LGA.

flights %>%
    rename(destination = dest,
           departure_delay = dep_delay,
           arrival_delay = arr_delay,
           time_in_air = air_time) %>%
    select(-year, -dep_time, -arr_time, -tailnum,
           -flight, -hour, -minute, -day) %>%
    mutate(ind_delayed_dep = ifelse(departure_delay > 0, 1, 0),
           ind_delayed_arr = ifelse(arrival_delay > 0, 1, 0)) %>%
    left_join(airlines, by = "carrier") %>%
    group_by(origin, name) %>%
    filter(carrier == "UA") %>%
    summarize(n_obs = n(),
              per_delayed_dep = round(sum(ind_delayed_dep, na.rm=TRUE) / n(),2),
              per_delayed_arr = round(sum(ind_delayed_arr, na.rm=TRUE) / n(),2)) %>%
    rename(Airport = origin,
           `Carrier Name` = name,
           `Number of Flights` = n_obs,
           `Proportion with Departure Delays` = per_delayed_dep,
           `Proportion with Arrival Delays` = per_delayed_arr) %>%
    pander(style = "rmarkdown", split.tables = 200)
Airport Carrier Name Number of Flights Proportion with Departure Delays Proportion with Arrival Delays
EWR United Air Lines Inc. 46087 0.49 0.38
JFK United Air Lines Inc. 4534 0.33 0.37
LGA United Air Lines Inc. 8044 0.38 0.36

4 Departure delays by month using base functions

It makes sense that delays might vary by the time of year. This analysis uses base R functions to examine this. I’m showing the original way the student did this using tapply (which is a base R equivalent of group_by and summarize) with barplot and the delightful rainbow color-making function!, but I’m also presenting an alternative version using dplyr and ggplot2 so that you can see the similarities and differences in the approaches.

4.1 tapply/base

# Get rid of NAs
flights_2 <- flights %>%
    filter(!is.na(dep_delay) & !is.na(month))
# Make string for months
months_Str <- c("JAN","FEB","MAR","APR","MAY","JUN",
                "JUL","AUG","SEP","OCT","NOV","DEC")
#Pull mean departure delays by month
mean_dep_delay_month <- tapply(flights_2$dep_delay, flights_2$month, mean)
color <- rainbow(12)
barplot(mean_dep_delay_month,
        names.arg = months_Str,
        main= "Average delayed departure time by month",
        col=color,
        xlab="Month",
        ylab="Time (Minutes)")

4.2 dplyr/ggplot2

months_Str <- c("JAN","FEB","MAR","APR","MAY","JUN",
                "JUL","AUG","SEP","OCT","NOV","DEC")

flights %>%
    filter(!is.na(dep_delay) & !is.na(month)) %>%
    group_by(month) %>%
    summarize(mean_dep_delay = mean(dep_delay)) %>%
    # make month categorical by wrapping factor() around it
    # and telling it the underlying labels
    ggplot(aes(x = factor(month, labels = months_Str),
               y = mean_dep_delay)) +
    # rainbow colors on the bars, but not related to the values
    # so not inside an aes(). color = "black" does the outline
    geom_bar(stat = "identity", fill = rainbow(12), color = "black") +
    xlab("Month") +
    ylab("Time (minutes)") +
    ggtitle("Average delayed departure time by month")

5 How many seats are on a plane?

This analysis investigated the years aircraft were manufactured (year on the planes dataset, not to be confused with year on the flights data!) and how many seats were on the planes. It doesn’t look like newer planes are holding more people, and you can see from the horizontal “stripes” in the plot that the airlines were probably making the same models in the same sizes year after year. It is also slightly terrifying that a plane made before 1970 would be flown in 2013 — perhaps this is a data error? One change I made is to use distinct in subsetting the data for plotting, as we only need one row per year per maker per seat count observation. Without it, we get the same result, but the plot takes considerably longer to render because the flights dataset has hundreds of thousands of points.

data_manufacturer <- flights %>%
    select(carrier, tailnum) %>%
    left_join(airlines, by = "carrier") %>%
    inner_join(planes, by = "tailnum") %>%
    #let's group Airbus and Airbus Industries into one category
    mutate(makers = ifelse((manufacturer == "AIRBUS" | manufacturer == "AIRBUS INDUSTRIE"), "AIRBUS", as.character(manufacturer))) 

# set up a vector of the top 3 within a list
list <- c("AIRBUS", 
          "BOEING",
          "EMBRAER")

# how has the number of seats per plane changed over time?  
ggplot(data = data_manufacturer %>%
           # filter to just the top 3 using the "in" function 
           filter(makers %in% list) %>% 
           select(makers, year, seats) %>%
           # just need one row per maker-year-seat
           # (many duplicates on the data!)
           distinct(makers, year, seats),
       aes(x = year, y=seats, color = makers)) +
  geom_point() +
  theme_light() + xlab("years") + ylab("number of seats") +
  ggtitle("How has number of seats per plane changed over time?") 

6 STL-bound delays by time of day

This student looked at flights going to St. Louis from NYC on the largest carriers serving that route and which times of day had higher departure delays. As we might have guessed, flights leaving later in the day are more delayed. Southwest passengers in particular seem to have really suffered in taking afternoon and evening flights from NYC!

flights %>%
    filter (dest == "STL") %>%
    left_join(airlines) %>%
    mutate(dep_delay_hours = (dep_delay/60)) %>%
    filter(dep_delay_hours < 10 & carrier %in% c("AA", "EV", "WN", "MQ")) %>%
    ggplot(aes(x = hour, y = dep_delay_hours, color = name))+
    facet_wrap( ~ name) +
    geom_point () +
    xlab ("Hour of the Day") +
    ylab ("Departure Delay (Hours)") +
    geom_line(stat = "smooth", method = "loess" , aes(group = name, color = name)) +
    theme_bw() +
    scale_color_manual(name = "Airline",
                       values = c("American Airlines Inc." = "royalblue4",
                                  "DeltaAir Lines Inc." = "blue4",
                                  "Envoy Air" = "springgreen4",
                                  "ExpressJet Airlines Inc." = "yellow",
                                  "Southwest Airlines Co." = "orange",
                                  "United Air Lines Inc." = "skyblue")) +
    guides(color=FALSE) +
    ggtitle("Delays in flights to St. Louis by time of day")