Practice using relational data

library(tidyverse)
library(nycflights13)
theme_set(theme_minimal())

Run the code below in your console to download this exercise as a set of R scripts.

usethis::use_course("cis-ds/data-wrangling-relational-data-and-factors")

For each exercise, use your knowledge of relational data and joining operations to compute a table or graph that answers the question. All questions use data frames from the nycflights13 package (if you have not previously installed it, do so using install.packages("nycflights13")).

Is there a relationship between the age of a plane and its departure delays?

Hint: all the data is from 2013.

Click for the solution

The first step is to calculate the age of each plane. To do that, use planes and the age variable:

(plane_ages <- planes %>%
  mutate(age = 2013 - year) %>%
  select(tailnum, age))
## # A tibble: 3,322 x 2
##    tailnum   age
##    <chr>   <dbl>
##  1 N10156      9
##  2 N102UW     15
##  3 N103US     14
##  4 N104UW     14
##  5 N10575     11
##  6 N105UW     14
##  7 N107US     14
##  8 N108UW     14
##  9 N109UW     14
## 10 N110UW     14
## # … with 3,312 more rows

The best approach to answering this question is a visualization. There are several different types of visualizations you could implement (e.g. scatterplot with smoothing line, line graph of average delay by age). The important thing is that we need to combine flights with plane_ages to determine for each flight the age of the plane. This is another mutating join. The best choice is inner_join() as this will automatically remove any rows in flights where we don’t have age data on the plane.

# smoothing line
flights %>%
  inner_join(y = plane_ages) %>%
  ggplot(mapping = aes(x = age, y = dep_delay)) +
  geom_smooth()
## Joining, by = "tailnum"
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9374 rows containing non-finite values (stat_smooth).

# line graph of average delay by age
flights %>%
  inner_join(y = plane_ages) %>%
  group_by(age) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = age, y = delay)) +
  geom_point() +
  geom_line()
## Joining, by = "tailnum"
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 row(s) containing missing values (geom_path).

In this situation, left_join() could also be used because ggplot() and mean(na.rm = TRUE) drop missing values (remember that left_join() keeps all rows from flights, even if we don’t have information on the plane).

flights %>%
  left_join(y = plane_ages) %>%
  ggplot(mapping = aes(x = age, y = dep_delay)) +
  geom_smooth()
## Joining, by = "tailnum"
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 61980 rows containing non-finite values (stat_smooth).

flights %>%
  left_join(y = plane_ages) %>%
  group_by(age) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = age, y = delay)) +
  geom_point() +
  geom_line()
## Joining, by = "tailnum"
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 row(s) containing missing values (geom_path).

The important takeaway is that departure delays do not appear to increase with plane age – in fact they seem to decrease slightly (though with an expanding confidence interval). Care to think of a reason why this may be so?

Add the location of the origin and destination (i.e. the lat and lon) to flights.

Click for the solution

This is a mutating join, and the basic function you need to use here is left_join(). We have to perform the joining operation twice since we want to create new variables based on both the destination airport and the origin airport. And because the name of the key variable differs between the data frames, we need to explicitly define how to join the data frames using the by argument:

flights %>%
  left_join(y = airports, by = c(dest = "faa")) %>%
  left_join(y = airports, by = c(origin = "faa"))
## # A tibble: 336,776 x 33
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 25 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name.x <chr>, lat.x <dbl>, lon.x <dbl>, alt.x <dbl>, tz.x <dbl>,
## #   dst.x <chr>, tzone.x <chr>, name.y <chr>, lat.y <dbl>, lon.y <dbl>,
## #   alt.y <dbl>, tz.y <dbl>, dst.y <chr>, tzone.y <chr>

Notice that with this approach, we are joining all of the columns in airports. The instructions just asked for latitude and longitude, so we can create a copy of airports that only includes the necessary variables (lat and lon, plus the primary key variable faa) and join flights to that data frame:

airports_lite <- airports %>%
  select(faa, lat, lon)

flights %>%
  left_join(y = airports_lite, by = c(dest = "faa")) %>%
  left_join(y = airports_lite, by = c(origin = "faa"))
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 15 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   lat.x <dbl>, lon.x <dbl>, lat.y <dbl>, lon.y <dbl>

This is better, but now we have two sets of latitude and longitude variables in the data frame: one for the destination airport, and one for the origin airport. When we perform the second left_join() operation, to avoid duplicate variable names the function automatically adds generic .x and .y suffixes to the output to disambiguate them. This is nice, but we might want something more intuitive to explicitly identify which variables are associated with the destination vs. the origin. To do that, we override the default suffix argument with custom suffixes:

airports_lite <- airports %>%
  select(faa, lat, lon)

flights %>%
  left_join(y = airports_lite, by = c(dest = "faa")) %>%
  left_join(y = airports_lite, by = c(origin = "faa"), suffix = c(".dest", ".origin"))
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 15 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   lat.dest <dbl>, lon.dest <dbl>, lat.origin <dbl>, lon.origin <dbl>

Acknowledgements

Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       macOS Big Sur 10.16         
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2021-05-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date       lib source        
##  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
##  backports      1.2.1   2020-12-09 [1] CRAN (R 4.0.2)
##  blogdown       1.3     2021-04-14 [1] CRAN (R 4.0.2)
##  bookdown       0.22    2021-04-22 [1] CRAN (R 4.0.2)
##  broom          0.7.6   2021-04-05 [1] CRAN (R 4.0.4)
##  bslib          0.2.5   2021-05-12 [1] CRAN (R 4.0.4)
##  cachem         1.0.5   2021-05-15 [1] CRAN (R 4.0.2)
##  callr          3.7.0   2021-04-20 [1] CRAN (R 4.0.2)
##  cellranger     1.1.0   2016-07-27 [1] CRAN (R 4.0.0)
##  cli            2.5.0   2021-04-26 [1] CRAN (R 4.0.2)
##  colorspace     2.0-1   2021-05-04 [1] CRAN (R 4.0.2)
##  crayon         1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
##  DBI            1.1.1   2021-01-15 [1] CRAN (R 4.0.2)
##  dbplyr         2.1.1   2021-04-06 [1] CRAN (R 4.0.4)
##  desc           1.3.0   2021-03-05 [1] CRAN (R 4.0.2)
##  devtools       2.4.1   2021-05-05 [1] CRAN (R 4.0.2)
##  digest         0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr        * 1.0.6   2021-05-05 [1] CRAN (R 4.0.2)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.0.2)
##  evaluate       0.14    2019-05-28 [1] CRAN (R 4.0.0)
##  fansi          0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
##  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.0.2)
##  forcats      * 0.5.1   2021-01-27 [1] CRAN (R 4.0.2)
##  fs             1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
##  generics       0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
##  ggplot2      * 3.3.3   2020-12-30 [1] CRAN (R 4.0.2)
##  glue           1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
##  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
##  haven          2.4.1   2021-04-23 [1] CRAN (R 4.0.2)
##  here           1.0.1   2020-12-13 [1] CRAN (R 4.0.2)
##  hms            1.1.0   2021-05-17 [1] CRAN (R 4.0.4)
##  htmltools      0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
##  httr           1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
##  jquerylib      0.1.4   2021-04-26 [1] CRAN (R 4.0.2)
##  jsonlite       1.7.2   2020-12-09 [1] CRAN (R 4.0.2)
##  knitr          1.33    2021-04-24 [1] CRAN (R 4.0.2)
##  lifecycle      1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
##  lubridate      1.7.10  2021-02-26 [1] CRAN (R 4.0.2)
##  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
##  memoise        2.0.0   2021-01-26 [1] CRAN (R 4.0.2)
##  modelr         0.1.8   2020-05-19 [1] CRAN (R 4.0.0)
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
##  nycflights13 * 1.0.2   2021-04-12 [1] CRAN (R 4.0.2)
##  pillar         1.6.1   2021-05-16 [1] CRAN (R 4.0.4)
##  pkgbuild       1.2.0   2020-12-15 [1] CRAN (R 4.0.2)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
##  pkgload        1.2.1   2021-04-06 [1] CRAN (R 4.0.2)
##  prettyunits    1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
##  processx       3.5.2   2021-04-30 [1] CRAN (R 4.0.2)
##  ps             1.6.0   2021-02-28 [1] CRAN (R 4.0.2)
##  purrr        * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  R6             2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
##  Rcpp           1.0.6   2021-01-15 [1] CRAN (R 4.0.2)
##  readr        * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
##  readxl         1.3.1   2019-03-13 [1] CRAN (R 4.0.0)
##  remotes        2.3.0   2021-04-01 [1] CRAN (R 4.0.2)
##  reprex         2.0.0   2021-04-02 [1] CRAN (R 4.0.2)
##  rlang          0.4.11  2021-04-30 [1] CRAN (R 4.0.2)
##  rmarkdown      2.8     2021-05-07 [1] CRAN (R 4.0.2)
##  rprojroot      2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
##  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.0.2)
##  rvest          1.0.0   2021-03-09 [1] CRAN (R 4.0.2)
##  sass           0.4.0   2021-05-12 [1] CRAN (R 4.0.2)
##  scales         1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  stringi        1.6.1   2021-05-10 [1] CRAN (R 4.0.2)
##  stringr      * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  testthat       3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
##  tibble       * 3.1.1   2021-04-18 [1] CRAN (R 4.0.2)
##  tidyr        * 1.1.3   2021-03-03 [1] CRAN (R 4.0.2)
##  tidyselect     1.1.1   2021-04-30 [1] CRAN (R 4.0.2)
##  tidyverse    * 1.3.1   2021-04-15 [1] CRAN (R 4.0.2)
##  usethis        2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
##  utf8           1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
##  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.0.2)
##  withr          2.4.2   2021-04-18 [1] CRAN (R 4.0.2)
##  xfun           0.23    2021-05-15 [1] CRAN (R 4.0.2)
##  xml2           1.3.2   2020-04-23 [1] CRAN (R 4.0.0)
##  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library