Pronto! is Seattle’s bike sharing program, which launched in fall 2014. You’ve probably seen the green bike docks around campus. (It has also been in the news in the past few months.)
You will be using data from the 2015 Pronto Cycle Share Data Challenge. These are available for download as a 75 MB ZIP file from https://s3.amazonaws.com/pronto-data/open_data_year_one.zip. (If the download link isn’t working for whatever reason, post on the Canvas forums and Rebecca will link to her copy.) Once unzipped, the folder containing all the files is around 900 MB. The
open_data_year_onefolder contains aREADME.txtfile that you should reference for documentation.
Questions for you to answer are as quoted blocks of text. Put your code used to address these questions and any comments you have below each block. Remember the guiding principle: don’t repeat yourself!
Set your working directory to be the
open_data_year_onefolder. Then use thelist.files()command to return a character vector giving all the files in that folder, and store it to an object calledfiles_in_year_one. Then use vector subsetting onfiles_in_year_oneto remove the entries forREADME.txt(which isn’t data) and for2015_status_data.csv(which is massive and doesn’t have interesting information, so we’re going to exclude it). Thus,files_in_year_oneshould be a character vector with three entries.
# load the libraries
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stringr)
# my Rmd file is in the same folder as the open_data_year_one folder
# so filepaths will be relative to this
# make my vector of filenames in the open_data_year_one folder
(files_in_year_one <- list.files("open_data_year_one"))
## [1] "2015_station_data.csv" "2015_status_data.csv" "2015_trip_data.csv"
## [4] "2015_weather_data.csv" "README.txt"
# remove the status data and README
(files_in_year_one <- files_in_year_one[-c(2, 5)])
## [1] "2015_station_data.csv" "2015_trip_data.csv" "2015_weather_data.csv"
Note: if you put your .Rmd file inside “open_data_year_one”, then you need to be careful with which entries of files_in_year_one you keep or drop. You don’t want to include the entries that go with your .Rmd file or anything that comes up when you knit! Depending on what you named your .Rmd file and how it appears alphabetically relative to the files we do want, you would maybe need to refer to different entries.
We want to read the remaining CSV files into data frames stored in a list called
data_list. Preallocate this usingdata_list <- vector("list", length(files_in_year_one)).
data_list <- vector("list", length(files_in_year_one))
We would like the names of the list entries to be simpler than the file names. For example, we want to read the
2015_station_data.csvfile intodata_list[["station_data"]], and2015_trip_data.csvintodata_list[["trip_data"]]. So, you should make a new vector calleddata_list_namesgiving the names of the objects to read in these CSV files to usingfiles_in_year_one. Use thesubstrfunction to keep the portion of thefiles_in_year_oneentries starting from the sixth character (which will drop the2015_part) and stopping at number of characters of each filename string, minus 4 (which will drop the.csvpart).
(data_list_names <- substr(files_in_year_one,
start = 6,
stop = nchar(files_in_year_one) - 4))
## [1] "station_data" "trip_data" "weather_data"
Set the names for
data_listusing thenamesfunction and thedata_list_namesvector.
names(data_list) <- data_list_names
data_list
## $station_data
## NULL
##
## $trip_data
## NULL
##
## $weather_data
## NULL
Then, write a
forloop that usesread_csvfrom thereadrpackage to read in all the CSV files contained in the ZIP file,seq_alonging thefiles_in_year_onevector. Store each of these files to its corresponding entry indata_list. The data download demo might be a helpful reference.
You will want to use the
cache=TRUEchunk option for this chunk — otherwise you’ll have to wait for the data to get read in every single time you knit. You will also want to make sure you are usingreadr::read_csvand not base R’sread.csvasreadr’s version is much faster, gives you a progress bar, and won’t convert all character variables to factors automatically.
# read in the data in the open_data_year_one folder
# paste0 to get the filepaths right
for(i in seq_along(files_in_year_one)) {
data_list[[i]] <- read_csv(paste0("open_data_year_one/", files_in_year_one[i]))
}
Run
strondata_listand look at how the variables came in usingread_csv. Most should be okay, but some of the dates and times may be stored as character rather than dates orPOSIXctdate-time values. We also have lots of missing values forgenderin the trip data because users who are not annual members do not report gender.
First, patch up the missing values for
genderindata_list[["trip_data"]]: if a user is aShort-Term Pass Holder, then put"Unknown"as theirgender. Don’t make new objects, but rather modify the entries indata_listdirectly (e.g.data_list[["trip_data"]] <- data_list[["trip_data"]] %>% mutate(...).
str(data_list)
## List of 3
## $ station_data:Classes 'tbl_df', 'tbl' and 'data.frame': 54 obs. of 7 variables:
## ..$ id : int [1:54] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ name : chr [1:54] "3rd Ave & Broad St" "2nd Ave & Vine St" "6th Ave & Blanchard St" "2nd Ave & Blanchard St" ...
## ..$ terminal : chr [1:54] "BT-01" "BT-03" "BT-04" "BT-05" ...
## ..$ lat : num [1:54] 47.6 47.6 47.6 47.6 47.6 ...
## ..$ long : num [1:54] -122 -122 -122 -122 -122 ...
## ..$ dockcount: int [1:54] 18 16 16 14 18 20 18 20 20 16 ...
## ..$ online : chr [1:54] "10/13/2014" "10/13/2014" "10/13/2014" "10/13/2014" ...
## $ trip_data :Classes 'tbl_df', 'tbl' and 'data.frame': 142846 obs. of 12 variables:
## ..$ trip_id : int [1:142846] 431 432 433 434 435 436 437 438 439 440 ...
## ..$ starttime : chr [1:142846] "10/13/2014 10:31" "10/13/2014 10:32" "10/13/2014 10:33" "10/13/2014 10:34" ...
## ..$ stoptime : chr [1:142846] "10/13/2014 10:48" "10/13/2014 10:48" "10/13/2014 10:48" "10/13/2014 10:48" ...
## ..$ bikeid : chr [1:142846] "SEA00298" "SEA00195" "SEA00486" "SEA00333" ...
## ..$ tripduration : num [1:142846] 986 926 884 866 924 ...
## ..$ from_station_name: chr [1:142846] "2nd Ave & Spring St" "2nd Ave & Spring St" "2nd Ave & Spring St" "2nd Ave & Spring St" ...
## ..$ to_station_name : chr [1:142846] "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" "Occidental Park / Occidental Ave S & S Washington St" ...
## ..$ from_station_id : chr [1:142846] "CBD-06" "CBD-06" "CBD-06" "CBD-06" ...
## ..$ to_station_id : chr [1:142846] "PS-04" "PS-04" "PS-04" "PS-04" ...
## ..$ usertype : chr [1:142846] "Annual Member" "Annual Member" "Annual Member" "Annual Member" ...
## ..$ gender : chr [1:142846] "Male" "Male" "Female" "Female" ...
## ..$ birthyear : int [1:142846] 1960 1970 1988 1977 1971 1974 1978 1983 1974 1958 ...
## $ weather_data:Classes 'tbl_df', 'tbl' and 'data.frame': 366 obs. of 21 variables:
## ..$ Date : chr [1:366] "10/13/2014" "10/14/2014" "10/15/2014" "10/16/2014" ...
## ..$ Max_Temperature_F : int [1:366] 71 63 62 71 64 68 73 66 64 60 ...
## ..$ Mean_Temperature_F : int [1:366] 62 59 58 61 60 64 64 60 58 58 ...
## ..$ Min_TemperatureF : int [1:366] 54 55 54 52 57 59 55 55 55 57 ...
## ..$ Max_Dew_Point_F : int [1:366] 55 52 53 49 55 59 57 57 52 55 ...
## ..$ MeanDew_Point_F : int [1:366] 51 51 50 46 51 57 55 54 49 53 ...
## ..$ Min_Dewpoint_F : int [1:366] 46 50 46 42 41 55 53 50 46 48 ...
## ..$ Max_Humidity : int [1:366] 87 88 87 83 87 90 94 90 87 88 ...
## ..$ Mean_Humidity : int [1:366] 68 78 77 61 72 83 74 78 70 81 ...
## ..$ Min_Humidity : int [1:366] 46 63 67 36 46 68 52 67 58 67 ...
## ..$ Max_Sea_Level_Pressure_In : num [1:366] 30 29.8 30 30 29.8 ...
## ..$ Mean_Sea_Level_Pressure_In: num [1:366] 29.8 29.8 29.7 29.9 29.8 ...
## ..$ Min_Sea_Level_Pressure_In : num [1:366] 29.6 29.5 29.5 29.8 29.7 ...
## ..$ Max_Visibility_Miles : int [1:366] 10 10 10 10 10 10 10 10 10 10 ...
## ..$ Mean_Visibility_Miles : int [1:366] 10 9 9 10 10 8 10 10 10 6 ...
## ..$ Min_Visibility_Miles : int [1:366] 4 3 3 10 6 2 6 5 6 2 ...
## ..$ Max_Wind_Speed_MPH : int [1:366] 13 10 18 9 8 10 10 12 15 14 ...
## ..$ Mean_Wind_Speed_MPH : int [1:366] 4 5 7 4 3 4 3 5 8 8 ...
## ..$ Max_Gust_Speed_MPH : num [1:366] 21 17 25 0 0 0 18 0 21 22 ...
## ..$ Precipitation_In : num [1:366] 0 0.11 0.45 0 0.14 0.31 0 0.44 0.1 1.43 ...
## ..$ Events : chr [1:366] "Rain" "Rain" "Rain" "Rain" ...
# make gender Unknown when user is a short term ride
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate(gender = ifelse(usertype == "Short-Term Pass Holder",
"Unknown", gender))
Now, use
dplyr::mutate_each, functions from thelubridatepackage, and thefactorfunction to fix any date/times, as well as to convert theusertypeandgendervariables to factor variables from the trip data. Don’t make new objects, but rather modify the entries indata_listdirectly.
# make the date-times valid:
# station_data: make online a date
data_list[["station_data"]] <- data_list[["station_data"]] %>%
mutate_each(funs(mdy), online)
# trip_data: starttime, stoptime should be date-time
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate_each(funs(mdy_hm), starttime, stoptime)
# weather_data: make Date a date
data_list[["weather_data"]] <- data_list[["weather_data"]] %>%
mutate_each(funs(mdy), Date)
# change variables with a few values to factors:
# trip_data: usertype, gender
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate_each(funs(factor), usertype, gender)
The
terminal,to_station_id, andfrom_station_idcolumns indata_list[["station_data"]]anddata_list[["trip_data"]]have a two or three character code followed by a hyphen and a numeric code. These character codes convey the broad geographic region of the stations (e.g.CBDis Central Business District,PSis Pioneer Square,IDis International District). Write a function calledregion_extractthat can extract these region codes by taking a character vector as input and returning another character vector that just has these initial character codes. For example, if I runregion_extract(x = c("CBD-11", "ID-01")), it should give me as output a character vector with first entry"CBD"and second entry"ID".
Note: if you cannot get this working and need to move on with your life, try writing your function to just take the first two characters using
substrand use that.
# function to return the alpha part of a string before the hyphen
region_extract <- function(x) {
beg_letters <- "^[A-Z]*" # matches uppercase letters from beginning, as many times as needed, until runs into some other kind of character
return(str_extract(x, beg_letters))
}
# test it out:
region_extract(x = c("CBD-11", "ID-01"))
## [1] "CBD" "ID"
Then on
data_list[["station_data"]]anddata_list[["trip_data"]], make new columns calledterminal_region,to_station_region, andfrom_station_regionusing yourregion_extractfunction.
# station_data: get region from terminal
data_list[["station_data"]] <- data_list[["station_data"]] %>%
mutate(terminal_region = region_extract(terminal))
# trip_data: get region from to_station_id and from_station_id
data_list[["trip_data"]] <- data_list[["trip_data"]] %>%
mutate(to_station_region = region_extract(to_station_id),
from_station_region = region_extract(from_station_id))
The
Eventscolumn indata_list[["weather_data"]]mentions if there was rain, thunderstorms, fog, etc. On some days you can see multiple weather events. Add a column to this data frame calledRainthat takes the value"Rain"if there was rain, and"No rain"otherwise. You will need to use some string parsing since"Rain"is not always at the beginning of the string (but again, if you are running short on time, just look for"Rain"at the beginning usingsubstras a working but imperfect approach). Then convert theRainvariable to a factor.
# if we see "Rain" in Events on weather_data, flag it
data_list[["weather_data"]] <- data_list[["weather_data"]] %>%
mutate(Rain = ifelse(str_detect(Events, "Rain"),
"Rain",
"No rain")) %>%
# a lot of days had no events recorded -- say "No rain" on these
mutate(Rain = ifelse(is.na(Rain),
"No rain",
Rain)) %>%
# make it a factor
mutate(Rain = factor(Rain))
You have bike station region information now, and rainy weather information. Make a new data frame called
trips_weatherthat joinsdata_list[["trip_data"]]withdata_list[["weather_data"]]by trip start date so that theRaincolumn is added to the trip-level data (just theRaincolumn please, none of the rest of the weather info). You may need to do some date manipulation and extraction as seen in Week 5 slides to get a date variable from thestarttimecolumn that you can use in merging.
trips_weather <- data_list[["trip_data"]] %>%
# make a column for just the date, in "Date" format
mutate(Date = as.Date(starttime)) %>%
# merge onto weather data, with just the Date and Rain columns
left_join(data_list[["weather_data"]] %>%
# use as.Date to make sure it ends up in "Date" format
mutate(Date = as.Date(Date)),
by = "Date")
Now for the grand finale. Write a function
daily_rain_ridesthat takes as input:
region_code: a region code (e.g."CBD","UW")direction: indicates whether we are thinking of trips"from"or"to"a region
and inside the function does the following:
- Filters the data to trips that came from stations with that region code or went to stations with that region code (depending on the values of
directionandregion_code). For example, if I sayregion_ code = "BT"(for Belltown) anddirection = "from", then I want to keep rows for trips whosefrom_station_regionis equal to"BT".- Makes a data frame called
temp_dfwith one row per day counting how many trips were inregion_codegoingdirection. This should have columns for trip starting date, how many trips there were that day, and whether there was rain or not that day. You’ll need to usedplyr::group_byandsummarize.- Uses
temp_dfto make aggplotscatterplot (geom_point) with trip starting date on the horizontal axis, number of trips on the vertical axis, and points colored"black"for days with no rain and"deepskyblue"for days with rain. Make sure the legend is clear and that the x axis is easy to understand without being overly labeled (control this withscale_x_date). The title of the plot should be customized to say which region code is shown and which direction is analyzed (e.g. “Daily rides going to SLU”) usingpaste0. Feel free to use whatever themeing you like on the plot or other tweaks to make it look great.- Returns the
ggplotobject with all its layers.
daily_rain_rides <- function(region_code, direction) {
# filter data conditionally on direction and region_code
if(direction == "to") {
temp_1 <- trips_weather %>%
filter(to_station_region == region_code)
}
if(direction == "from") {
temp_1 <- trips_weather %>%
filter(from_station_region == region_code)
}
# summarize trips per day in that direction
temp_df <- temp_1 %>%
group_by(Date, Rain) %>%
tally()
# plot, colored by weather
ggplot(data = temp_df,
aes(x = Date,
y = n,
color = Rain,
group = Rain)) +
geom_point() +
geom_smooth() +
scale_x_date(name = "Date") +
ylab("Number of rides") +
scale_color_manual(name = "Weather",
values = c("black", "deepskyblue")) +
ggtitle(paste0("Daily rides going ", direction,
" ", region_code)) +
theme_minimal()
}
Then, test out your function: make three plots using
daily_rain_rides, trying out different values of the region code and direction to show it works.
daily_rain_rides("SLU", "from")
daily_rain_rides("CH", "to")
daily_rain_rides("UW", "to")