CSSS 508, Week 1

Rebecca Ferrell
March 30, 2016

Course information

Course goals

Develop intermediate data analysis skills in R
Learn programming basics
Drink reproducibility Kool-Aid
Pick up some handy statistical techniques

Logistics

Savery 117 on Wednesdays, 3:30-5:20
Format: lecture -> short break -> lab exercise (approximately)
Materials: http://rebeccaferrell.github.io/CSSS508
Office hours: TBD, will be in Padelford C-14A
Grading:
- Final grade: Cr/NC, 60% to get Cr
- Homework most weeks (70% of grade), combo of reading and programming
- Participation = peer grading the HW (30% of grade)
- Grades and discussion forums maintained on Canvas page

Peer grading, really?!

Yes, peer grading…

Write your reports better knowing others will see them
Learn new tricks from reading others' code, “appreciate the little things”
Randomly assigned peers, turnaround time: 1 week (due before next class)
You'll get a rubric and are expected to leave constructive comments
High scoring assignments will be adapted into keys (with your permission)
Talk to me if you want more feedback

Getting help on the Canvas forums

Don't be Goofus:

tried lm(y~x) but it iddn't work wat do i do

Instead, be Gallant:

y <- seq(1:10) + rnorm(10)
x <- seq(0:10)
model <- lm(y ~ x)

Running the block above gives me the following error, anyone know why?

Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

Rough schedule

Intro to R, RStudio, and R Markdown
Making awesome plots with ggplot2
Summarizing and combining data: dplyr
R data structures
Data cleaning: import/export, tidyr, dates
Programming and functions
More programming and functions
String/text processing
Taking requests for final weeks!

Data sources

Do you have a source of data you'd like to see used in this class? Talk to me: rferrell@uw.edu

Must have no restrictions on public use
Pertinent to social science, public policy, global health
Examples:
- Census public use microdata sample
- City of Seattle, King County, State of WA
- World Health Organization, World Bank, United Nations
- Data made public from studies

Using R Markdown Within RStudio

R and RStudio

is a programming language and software environment for statistics. You will learn how to write R code in this course.

is a front-end for R. You do not have to use it to use R, but it will make your life easier:

Organizes your code, output, and plotting windows
Code auto-completion and syntax highlighting
GUI for viewing data, objects in your environment
Much easier to make R Markdown documents

Selling you on R Markdown

Killer feature of R/RStudio is ease of making R Markdown files:

Document analyses as you go with integrated text, code, and output
- No rerunning, recopying, repasting
- Easy for collaborators to understand
- Show as little or as much of the guts as you want
Make presentations (like this one!), webpages, reports in html, Word formats
Works with LaTeX for PDFs, more formatting control

RStudio

Open up RStudio now (CSSCR lab instructions), and choose File > New File > R Script. Let's get oriented with the interface:

Top left: code editor pane, viewing data (tabbed)
Bottom left: console for running code (> prompt)
Top right: list of objects in environment, code history tab
Bottom right: tabs for browsing files, viewing plots, managing packages, and viewing help files

You can change the layout in Preferences > Pane Layout.

Editing and running code

There are several ways to run R code in RStudio:

Highlight lines in the editor window and click Run or hit Cntl-Enter (Command-Enter on a Mac) to run them all
Type individual lines in the console and press Enter
In R Markdown documents, click within a code block (“chunk”) and click on a choice in the Chunks dropdown

The console will show the lines you ran followed by any printed output. If you mess up (e.g. leave off a parenthesis), R might show a + sign prompting you to finish the command:

> (11 - 2
+

Finish the command or hit Esc to get out of this.

R as a calculator

In the console, type 123 + 456 + 789 and hit Enter.

123 + 456 + 789

[1] 1368

Now in your blank R document in the editor, try typing the line sqrt(400) and either clicking Run or hitting Cntl-Enter.

sqrt(400)

[1] 20

Functions and help

sqrt is an example of a function in R.

If we didn't have a good guess as to what sqrt will do, we can type ?sqrt in the console and look at the Help panel on the right.

?sqrt

Arguments are the inputs to a function. In this case, the only argument to sqrt is a number or set of numbers x.

Making R Markdown documents

Let's try making an R Markdown file:

Choose File > New File > R Markdown…
Make sure HTML output is selected and click OK
Save the file somewhere, call it my_first_Rmd.Rmd
Click the Knit HTML button
Watch the progress in the R Markdown pane, then gaze upon your result! You may also open up the file in your computer's browser if you so desire.

YAML headers

The header of an .Rmd file is a YAML block (Yet Another Markup Language) code, and everything else is part of the main document.

---
title: "Untitled"
author: "Rebecca Ferrell"
date: "March 30, 2016"
output: html_document
---

To mess with global formatting, you can modify the header:

output:
  html_document:
    theme: readable

R Markdown syntax

bold/strong emphasis

italic/normal emphasis

Header

Subheader

Subsubheader

Block quote from famous person

**bold/strong emphasis**

*italic/normal emphasis*

# Header

## Subheader

### Subsubheader

> Block quote from
> famous person

More R Markdown syntax

Ordered lists
Are real easy
1. Even with sublists

Unordered lists
Are also real easy
- Also even with sublists

URLs are trivial

pictures aren't bad either

1. Ordered lists
1. Are real easy
  + Even with sublists

* Unordered lists
* Are also real easy
  + Also even with sublists

[URLs are trivial](http://www.uw.edu)

![pictures aren't bad either](http://depts.washington.edu/uwcreate/img/UW_W-Logo_smallRGB.gif)

Nerdy syntax

You can put some math $ y= \left( \frac{2}{3} \right)^2 $ right up in there.

\[ \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n \]

Or a sentence with code-looking font.

y <- 1:5
z <- y^2

You can put some math $y= \left( \frac{2}{3} \right)^2$ right up in there

$$\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n$$

Or a sentence with `code-looking font`.

```
y <- 1:5
z <- y^2
```

Your turn

Now try editing and re-knitting my_first_Rmd.Rmd to give it some pizzazz: add sections, have it number the sections, make a list, throw in the UW logo, change the theme, insert a gratuitous equation…

R Markdown formatting caveat

To keep R Markdown dead-simple, it lacks some features you might occasionally want to use. Your options for fancier documents are:

Use HTML/CSS for custom formatting
Use LaTeX and .Rnw files instead of .Rmd

For day-to-day use, plain vanilla R Markdown does the job.

R code and R Markdown

Lines of R code make up what are called chunks. Code is sandwiched between sets of three backticks and {r}.

```{r}
summary(cars)
```

summary(cars)

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

Chunk options

Chunks have options that control what happens with its code, such as:

echo=FALSE: keeps R code from being shown in the document
results='hide': hides R's (non-plot) output from the document
warning=FALSE, message=FALSE: suppresses R warnings or messages from the document, e.g. about versions
eval=FALSE: show R code in the document without running it
cache=TRUE: saves results of running that chunk so if it takes a while, you won't have to re-run it each time you re-knit the document
fig.height=5, fig.width=5: modify the dimensions of any plots that are generated in the chunk (units are in inches)

Playing with chunk options

Try adding or changing the chunk options (separated by commas) for the two chunks in my_first_Rmd.Rmd and re-knitting to check what happens.

You can also name your chunks by putting something after the r before the chunk options.

```{r summarize_cars, echo=FALSE}
summary(cars)
```

After you name your chunks, look what happens in the dropdown on the bottom left of your editor pane.

In-line R code

Sometimes we want to insert a value directly into our text. We do that using code in single backticks starting off with r.

Four score and seven years ago is the same as `r 4*20 + 7` years.

Four score and seven years ago is the same as 87 years.

Maybe we've saved a variable in a chunk we want to reference in the text:

x <- sqrt(77) # <- is how we assign variables

The value of `x` rounded to the nearest two decimals is `r round(x, 2)`.

The value of x rounded to the nearest two decimals is 8.77.

What I just showed you is AMAZING

Having R dump values directly into your document protects you from silly mistakes:

Never wonder “how did I come up with this quantity?” ever again — just look at your formula in your .Rmd file
Consistency, no “find/replace” mishaps: reference a variable in-line throughout your document without manually updating if the calculation changes (e.g. reporting sample sizes)
You are more likely to make a typo in a “hard-coded” number than you are to write R code that somehow runs but gives you the wrong thing

Example: keeping document date current

In your YAML header, make the date come from R's Sys.time() function by changing:

date: "March 30, 2016"

to:

date: " `r Sys.time()` "

Fancier option: use format(Sys.Date(), format="%B %d, %Y") instead to take today's date and make it read nicely.

Data frames

What's up with cars?

In the sample R Markdown document you are working on, cars is a built-in data frame. You can look at it in a couple of ways:

View(cars): pops up a spreadsheet-looking viewer (“interactive” use only, don't put in R Markdown document!) or

head(cars) # prints first 6 rows, see tail() too

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

Tell me more about cars...

str(cars) # str[ucture]

'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

summary(cars)

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

Drawing (ugly) pictures of cars

hist(cars$speed)

plot of chunk unnamed-chunk-9

hist(cars$dist)

plot of chunk unnamed-chunk-10

Drawing (marginally less ugly) pictures of cars

hist(cars$dist,
     xlab = "Distance (ft)",
     main = "Observed stopping distances of cars")

plot of chunk unnamed-chunk-12

Doing a little car math

If you put an assignment <- in parentheses (), R will print the output of the assignment out for you in your document. Otherwise, it won't show the value.

( dist_mean <- mean(cars$dist) )

[1] 42.98

( speed_mean <- mean(cars$speed) )

[1] 15.4

Drawing (frankly still ugly) pictures of cars

plot(dist ~ speed, data = cars,
     xlab = "Speed (mph)",
     ylab = "Stopping distance (ft)",
     main = "Speeds and stopping distances of cars",
     pch = 16)
abline(h = dist_mean, col = "firebrick")
abline(v = speed_mean, col = "cornflowerblue")

plot of chunk unnamed-chunk-15

Swiss time

Let's switch gears to the swiss data frame built in to R.

First, use ?swiss to see what things mean.

Add chunks to your R Markdown document inspecting swiss, defining variables, doing some exploratory plots using hist or plot. You can experiment with colors and shapes.

Looking at everything at once

pairs(swiss, pch = 8, col = "violet", main = "Pairwise comparisons of Swiss variables")

Linear regression

( f_ae_reg <- lm(Fertility ~ Agriculture + Education, data = swiss) )


Call:
lm(formula = Fertility ~ Agriculture + Education, data = swiss)

Coefficients:
(Intercept)  Agriculture    Education  
   84.08005     -0.06648     -0.96276

Try running summary(f_ae_reg).

Installing packages

Let's make a table that looks a little less code-y in the output. To do this, we'll want to install a package called pander.

In the console: install.packages("pander", repos = "http://cran.us.r-project.org").

You can leave off the repos argument if you configure your default repository settings: Preferences > Packages > CRAN mirror. I suggest the Hutch.
Once you install a package, you don't need to re-install it until you update R.

Making tables

library(pander) # loads pander, do once in your session
pander(summary(swiss), style = "rmarkdown", split.tables = 120)

Fertility	Agriculture	Examination	Education	Catholic	Infant.Mortality
Min. :35.00	Min. : 1.20	Min. : 3.00	Min. : 1.00	Min. : 2.150	Min. :10.80
1st Qu.:64.70	1st Qu.:35.90	1st Qu.:12.00	1st Qu.: 6.00	1st Qu.: 5.195	1st Qu.:18.15
Median :70.40	Median :54.10	Median :16.00	Median : 8.00	Median : 15.140	Median :20.00
Mean :70.14	Mean :50.66	Mean :16.49	Mean :10.98	Mean : 41.144	Mean :19.94
3rd Qu.:78.45	3rd Qu.:67.65	3rd Qu.:22.00	3rd Qu.:12.00	3rd Qu.: 93.125	3rd Qu.:21.70
Max. :92.50	Max. :89.70	Max. :37.00	Max. :53.00	Max. :100.000	Max. :26.60

Data looks a little nicer this way

pander(head(swiss, 5), style = "rmarkdown", split.tables = 120)

	Fertility	Agriculture	Examination	Education	Catholic	Infant.Mortality
Courtelary	80.2	17	15	12	9.96	22.2
Delemont	83.1	45.1	6	9	84.84	22.2
Franches-Mnt	92.5	39.7	5	5	93.4	20.2
Moutier	85.8	36.5	12	7	33.77	20.3
Neuveville	76.9	43.5	17	15	5.16	20.6

Homework

Write up a .Rmd file showing some exploratory analyses of the Swiss fertility data. Upload both the .Rmd file and the .html file to Canvas.

Mix in-line R calculations, tables, R output, and plots with text describing the relationships you see. You must use in-line R calculations/references at least once (e.g. with functions like nrow, mean, sd, cor, median, min, quantile) and may not hard-code any numbers referenced in your text.

Your document should be pleasant for a peer to look at, with some organization using sections or lists, and all plots labeled clearly. Use chunk options echo and results to limit the code/output you show in the .html. Discussion of specific values should be summarized in sentences in your text — not as printed code and output — and rounded so as not to be absurdly precise.

Grading rubric

0: Didn't turn anything in
1: Turned in but low effort, ignoring many directions
2: Decent effort, followed directions with some minor issues
3: Nailed it