Practice transforming college education (data)
library(tidyverse)
Run the code below in your console to download this exercise as a set of R scripts.
usethis::use_course("cis-ds/data-transformation")
The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2018-19 in the rcis
library from GitHub. To install the package, run the command remotes::install_github("cis-ds/rcis")
in the console.
remotes
library installed, you will get an error. Go back and install this first using install.packages("remotes")
, then run remotes::install_github("cis-ds/rcis")
.library(rcis)
##
## Attaching package: 'rcis'
## The following objects are masked from 'package:rcfss':
##
## add_ci, bechdel, cfss_notes, cfss_slides, err.rate.rf,
## err.rate.tree, logit2prob, mse, mse_vec, plot_ci, prob2logodds,
## prob2odds, xaringan, xaringan_wide
data("scorecard")
glimpse(scorecard)
## Rows: 1,732
## Columns: 14
## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…
## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…
## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…
## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…
## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…
## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…
## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…
## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…
## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb…
glimpse()
is part of the tibble
package and is a transposed version of print()
: columns run down the page, and data runs across. With a data frame with multiple columns, sometimes there is not enough horizontal space on the screen to print each column. By transposing the data frame, we can see all the columns and the values recorded for the initial rows.Type ?scorecard
in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of the dplyr
functions to perform the following tasks.
Generate a data frame of schools with a greater than 40% share of first-generation students
Click for the solution
filter(.data = scorecard, firstgen > .40)
## # A tibble: 356 × 14
## unitid name state type admrate satavg cost netcost avgfa…¹ pctpell compr…²
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 101189 Faul… AL Priv… 0.783 1066 34317 20715 54009 0.488 0.329
## 2 101365 Herz… AL Priv… 0.783 NA 30119 26680 54684 0.706 0.28
## 3 101541 Juds… AL Priv… 0.372 1020 32691 16827 52020 0.545 0.364
## 4 101587 Univ… AL Publ… 0.349 1041 21657 15514 58329 0.535 0.350
## 5 102270 Stil… AL Priv… 0.330 NA 25413 18352 43605 0.709 0.272
## 6 104717 Gran… AZ Priv… 0.769 NA 31213 21020 60741 0.454 0.408
## 7 106467 Arka… AR Publ… 0.947 NA 18358 10772 61812 0.361 0.384
## 8 107983 Sout… AR Publ… 0.651 1085 22579 14270 61650 0.487 0.436
## 9 110361 Cali… CA Priv… 0.783 1096 46261 24707 88335 0.453 0.630
## 10 110486 Cali… CA Publ… 0.807 NA 16660 5318 86760 0.619 0.430
## # … with 346 more rows, 3 more variables: firstgen <dbl>, debt <dbl>,
## # locale <fct>, and abbreviated variable names ¹avgfacsal, ²comprate
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Generate a data frame with the 10 most expensive colleges in 2018-19 based on net cost of attendance
Click for the solution
We could use a combination of arrange()
and slice()
to sort the data frame from most to least expensive, then keep the first 10 rows:
arrange(.data = scorecard, desc(netcost)) %>%
slice(1:10)
## # A tibble: 10 × 14
## unitid name state type admrate satavg cost netcost avgfa…¹ pctpell compr…²
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 192712 Manh… NY Priv… 0.355 NA 68686 54902 73863 0.129 0.876
## 2 111081 Cali… CA Priv… 0.253 NA 71382 50412 86760 0.248 0.614
## 3 136774 Ring… FL Priv… 0.639 NA 67325 49649 78435 0.284 0.672
## 4 164748 Berk… MA Priv… 0.514 NA 64436 49514 93870 0.166 0.652
## 5 247649 Land… VT Priv… 0.470 NA 73821 47373 59373 0.219 0.416
## 6 109651 Art … CA Priv… 0.708 NA 64316 47080 71523 0.283 0.685
## 7 135726 Univ… FL Priv… 0.271 1371 67249 46949 115353 0.143 0.831
## 8 194578 Prat… NY Priv… 0.555 1273 67703 45571 101079 0.198 0.704
## 9 165662 Emer… MA Priv… 0.334 1318 68350 45365 90747 0.161 0.821
## 10 143048 Scho… IL Priv… 0.570 1238 67058 44815 96102 0.192 0.699
## # … with 3 more variables: firstgen <dbl>, debt <dbl>, locale <fct>, and
## # abbreviated variable names ¹avgfacsal, ²comprate
## # ℹ Use `colnames()` to see all variable names
We can also use the slice_max()
function in dplyr
to accomplish the same thing in one line of code.
slice_max(.data = scorecard, order_by = netcost, n = 10)
## # A tibble: 10 × 14
## unitid name state type admrate satavg cost netcost avgfa…¹ pctpell compr…²
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 192712 Manh… NY Priv… 0.355 NA 68686 54902 73863 0.129 0.876
## 2 111081 Cali… CA Priv… 0.253 NA 71382 50412 86760 0.248 0.614
## 3 136774 Ring… FL Priv… 0.639 NA 67325 49649 78435 0.284 0.672
## 4 164748 Berk… MA Priv… 0.514 NA 64436 49514 93870 0.166 0.652
## 5 247649 Land… VT Priv… 0.470 NA 73821 47373 59373 0.219 0.416
## 6 109651 Art … CA Priv… 0.708 NA 64316 47080 71523 0.283 0.685
## 7 135726 Univ… FL Priv… 0.271 1371 67249 46949 115353 0.143 0.831
## 8 194578 Prat… NY Priv… 0.555 1273 67703 45571 101079 0.198 0.704
## 9 165662 Emer… MA Priv… 0.334 1318 68350 45365 90747 0.161 0.821
## 10 143048 Scho… IL Priv… 0.570 1238 67058 44815 96102 0.192 0.699
## # … with 3 more variables: firstgen <dbl>, debt <dbl>, locale <fct>, and
## # abbreviated variable names ¹avgfacsal, ²comprate
## # ℹ Use `colnames()` to see all variable names
Generate a data frame with the average SAT score for each type of college
Click for the solution
scorecard %>%
group_by(type) %>%
summarize(mean_sat = mean(satavg, na.rm = TRUE))
## # A tibble: 3 × 2
## type mean_sat
## <fct> <dbl>
## 1 Public 1126.
## 2 Private, nonprofit 1152.
## 3 Private, for-profit 1121.
Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name and the calculated value
Note: use the net cost of attendance.
Click for the solution
scorecard %>%
mutate(ratio = avgfacsal / netcost) %>%
select(name, ratio)
## # A tibble: 1,732 × 2
## name ratio
## <chr> <dbl>
## 1 Alabama A & M University 4.63
## 2 University of Alabama at Birmingham 5.87
## 3 University of Alabama in Huntsville 5.50
## 4 Alabama State University 4.76
## 5 The University of Alabama 4.10
## 6 Auburn University at Montgomery 5.10
## 7 Auburn University 4.01
## 8 Birmingham-Southern College 2.56
## 9 Faulkner University 2.61
## 10 Herzing University-Birmingham 2.05
## # … with 1,722 more rows
## # ℹ Use `print(n = ...)` to see more rows
Calculate how many private, nonprofit schools have a smaller net cost than Cornell University
Hint: the result should be a data frame with one row for Cornell University, and a column containing the requested value.
Report the number as the total number of schools
Click for the solution
scorecard %>%
filter(type == "Private, nonprofit") %>%
arrange(netcost) %>%
# use row_number() but subtract 1 since Cornell is not cheaper than itself
mutate(school_cheaper = row_number() - 1) %>%
filter(name == "Cornell University") %>%
glimpse()
## Rows: 1
## Columns: 15
## $ unitid <dbl> 190415
## $ name <chr> "Cornell University"
## $ state <chr> "NY"
## $ type <fct> "Private, nonprofit"
## $ admrate <dbl> 0.1085
## $ satavg <dbl> 1487
## $ cost <dbl> 73879
## $ netcost <dbl> 40126
## $ avgfacsal <dbl> 140166
## $ pctpell <dbl> 0.1683
## $ comprate <dbl> 0.9453
## $ firstgen <dbl> 0.154164
## $ debt <dbl> 13108
## $ locale <fct> City
## $ school_cheaper <dbl> 1043
Report the number as the percentage of schools
Click for the solution
scorecard %>%
filter(type == "Private, nonprofit") %>%
mutate(netcost_rank = percent_rank(netcost)) %>%
filter(name == "Cornell University") %>%
glimpse()
## Rows: 1
## Columns: 15
## $ unitid <dbl> 190415
## $ name <chr> "Cornell University"
## $ state <chr> "NY"
## $ type <fct> "Private, nonprofit"
## $ admrate <dbl> 0.1085
## $ satavg <dbl> 1487
## $ cost <dbl> 73879
## $ netcost <dbl> 40126
## $ avgfacsal <dbl> 140166
## $ pctpell <dbl> 0.1683
## $ comprate <dbl> 0.9453
## $ firstgen <dbl> 0.154164
## $ debt <dbl> 13108
## $ locale <fct> City
## $ netcost_rank <dbl> 0.9702326
Session Info
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Monterey 12.3
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-08-10
## pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.0)
## backports 1.4.1 2021-12-13 [2] CRAN (R 4.2.0)
## blogdown 1.10 2022-05-10 [2] CRAN (R 4.2.0)
## bookdown 0.27 2022-06-14 [2] CRAN (R 4.2.0)
## broom 1.0.0 2022-07-01 [2] CRAN (R 4.2.0)
## bslib 0.4.0 2022-07-16 [2] CRAN (R 4.2.0)
## cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.2.0)
## cli 3.3.0 2022-04-25 [2] CRAN (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
## colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.0)
## crayon 1.5.1 2022-03-26 [2] CRAN (R 4.2.0)
## DBI 1.1.3 2022-06-18 [2] CRAN (R 4.2.0)
## dbplyr 2.2.1 2022-06-27 [2] CRAN (R 4.2.0)
## digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [2] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
## evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.1)
## fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [2] CRAN (R 4.2.0)
## fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
## gargle 1.2.0 2021-07-02 [2] CRAN (R 4.2.0)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [2] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
## googledrive 2.0.0 2021-07-08 [2] CRAN (R 4.2.0)
## googlesheets4 1.0.0 2021-07-21 [2] CRAN (R 4.2.0)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.2.0)
## haven 2.5.0 2022-04-15 [2] CRAN (R 4.2.0)
## here 1.0.1 2020-12-13 [2] CRAN (R 4.2.0)
## hms 1.1.1 2021-09-26 [2] CRAN (R 4.2.0)
## htmltools 0.5.3 2022-07-18 [2] CRAN (R 4.2.0)
## httr 1.4.3 2022-05-04 [2] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.2.0)
## knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
## modelr 0.1.8 2020-05-19 [2] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.0)
## pillar 1.8.0 2022-07-18 [2] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
## rcfss * 0.2.5 2022-08-04 [2] local
## rcis * 0.2.5 2022-08-08 [2] local
## readr * 2.1.2 2022-01-30 [2] CRAN (R 4.2.0)
## readxl 1.4.0 2022-03-28 [2] CRAN (R 4.2.0)
## reprex 2.0.1.9000 2022-08-10 [1] Github (tidyverse/reprex@6d3ad07)
## rlang 1.0.4 2022-07-12 [2] CRAN (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.2.0)
## rvest 1.0.2 2021-10-16 [2] CRAN (R 4.2.0)
## sass 0.4.2 2022-07-16 [2] CRAN (R 4.2.0)
## scales 1.2.0 2022-04-13 [2] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
## stringi 1.7.8 2022-07-11 [2] CRAN (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
## tibble * 3.1.8 2022-07-22 [2] CRAN (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [2] CRAN (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.2.0)
## tidyverse * 1.3.2 2022-07-18 [2] CRAN (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [2] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
## xml2 1.3.3 2021-11-30 [2] CRAN (R 4.2.0)
## yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
##
## [1] /Users/soltoffbc/Library/R/arm64/4.2/library
## [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────