Pipes in R
Pipes are an extremely useful tool from the magrittr
package1 that allow you to express a sequence of multiple operations. They can greatly simplify your code and make your operations more intuitive. However they are not the only way to write your code and combine multiple operations. In fact, for many years the pipe did not exist in R. How else did people write their code?
Suppose we have the following assignment:
Using the
penguins
dataset, calculate the average body mass for Adelie penguins on different islands.
Okay, first let’s load our libraries and check out the data frame.
library(tidyverse)
library(palmerpenguins)
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge… 39.1 18.7 181 3750
## 2 Adelie Torge… 39.5 17.4 186 3800
## 3 Adelie Torge… 40.3 18 195 3250
## 4 Adelie Torge… NA NA NA NA
## 5 Adelie Torge… 36.7 19.3 193 3450
## 6 Adelie Torge… 39.3 20.6 190 3650
## 7 Adelie Torge… 38.9 17.8 181 3625
## 8 Adelie Torge… 39.2 19.6 195 4675
## 9 Adelie Torge… 34.1 18.1 193 3475
## 10 Adelie Torge… 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
We can decompose the problem into a series of discrete steps:
- Filter
penguins
to only keep observations where the species is “Adelie” - Group the filtered
penguins
data frame by island - Summarize the grouped and filtered
penguins
data frame by calculating the average body mass
But how do we implement the code?
Intermediate steps
One option is to save each step as a new object:
penguins_1 <- filter(penguins, species == "Adelie")
penguins_2 <- group_by(penguins_1, island)
(penguins_3 <- summarize(penguins_2, body_mass = mean(body_mass_g, na.rm = TRUE)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## island body_mass
## <fct> <dbl>
## 1 Biscoe 3710.
## 2 Dream 3688.
## 3 Torgersen 3706.
Why do we not like doing this? We have to name each intermediate object. Here I just append a number to the end, but this is not good self-documentation. What should we expect to find in penguins_2
? It would be nicer to have an informative name, but there isn’t a natural one. Then we have to remember how the data exists in each intermediate step and remember to reference the correct one. What happens if we misidentify the data frame?
penguins_1 <- filter(penguins, species == "Adelie")
penguins_2 <- group_by(penguins_1, island)
(penguins_3 <- summarize(penguins_1, body_mass = mean(body_mass_g, na.rm = TRUE)))
## # A tibble: 1 x 1
## body_mass
## <dbl>
## 1 3701.
We don’t get the correct answer. Worse, we don’t get an explicit error message because the code, as written, works. R can execute this command for us and doesn’t know to warn us that we used penguins_1
instead of penguins_2
.
Overwrite the original
Instead of creating intermediate objects, let’s just replace the original data frame with the modified form.
penguins <- filter(penguins, species == "Adelie")
penguins <- group_by(penguins, island)
(penguins <- summarize(penguins, body_mass = mean(body_mass_g, na.rm = TRUE)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## island body_mass
## <fct> <dbl>
## 1 Biscoe 3710.
## 2 Dream 3688.
## 3 Torgersen 3706.
This works, but still has a couple of problems. What happens if I make an error in the middle of the operation? I need to rerun the entire operation from the beginning. With your own data sources, this means having to read in the .csv
file all over again to restore a fresh copy.
Function composition
We could string all the function calls together into a single object and forget assigning it anywhere.
summarize(
group_by(
filter(
penguins,
species == "Adelie"
),
island
),
body_mass = mean(body_mass_g, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## island body_mass
## <fct> <dbl>
## 1 Biscoe 3710.
## 2 Dream 3688.
## 3 Torgersen 3706.
But now we have to read the function from the inside out. Even worse, what happens if we cram it all into a single line?
summarize(group_by(filter(penguins, species == "Adelie"), island), body_mass = mean(body_mass_g, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## island body_mass
## <fct> <dbl>
## 1 Biscoe 3710.
## 2 Dream 3688.
## 3 Torgersen 3706.
This is not intuitive for humans. Again, the computer will handle it just fine, but if you make a mistake debugging it will be a pain.
Back to the pipe
penguins %>%
filter(species == "Adelie") %>%
group_by(island) %>%
summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## island body_mass
## <fct> <dbl>
## 1 Biscoe 3710.
## 2 Dream 3688.
## 3 Torgersen 3706.
Piping is the clearest syntax to implement, as it focuses on actions, not objects. Or as Hadley would say:
[I]t focuses on verbs, not nouns.
magrittr
automatically passes the output from the first line into the next line as the input.
This is how I explain the 'pipe' to #rstats newbies... pic.twitter.com/VdAFTLzijy
— We are R-Ladies (@WeAreRLadies) September 13, 2019
This is why tidyverse
functions always accept a data frame as the first argument.
Important tips for piping
- Remember though that you don’t assign anything within the pipes - that is, you should not use
<-
inside the piped operation. Only use this at the beginning if you want to save the output - Remember to add the pipe
%>%
at the end of each line involved in the piped operation. A good rule of thumb: RStudio will automatically indent lines of code that are part of a piped operation. If the line isn’t indented, it probably hasn’t been added to the pipe. If you have an error in a piped operation, always check to make sure the pipe is connected as you expect.
Session Info
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.4 (2021-02-15)
## os macOS Big Sur 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Chicago
## date 2021-05-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## blogdown 1.3 2021-04-14 [1] CRAN (R 4.0.2)
## bookdown 0.22 2021-04-22 [1] CRAN (R 4.0.2)
## bslib 0.2.5 2021-05-12 [1] CRAN (R 4.0.4)
## cachem 1.0.5 2021-05-15 [1] CRAN (R 4.0.2)
## callr 3.7.0 2021-04-20 [1] CRAN (R 4.0.2)
## cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.2)
## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
## desc 1.3.0 2021-03-05 [1] CRAN (R 4.0.2)
## devtools 2.4.1 2021-05-05 [1] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## here 1.0.1 2020-12-13 [1] CRAN (R 4.0.2)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
## knitr 1.33 2021-04-24 [1] CRAN (R 4.0.2)
## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
## memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.2)
## pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.2)
## pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.0.2)
## prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
## processx 3.5.2 2021-04-30 [1] CRAN (R 4.0.2)
## ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.2)
## purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## remotes 2.3.0 2021-04-01 [1] CRAN (R 4.0.2)
## rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
## rmarkdown 2.8 2021-05-07 [1] CRAN (R 4.0.2)
## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
## sass 0.4.0 2021-05-12 [1] CRAN (R 4.0.2)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
## stringi 1.6.1 2021-05-10 [1] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
## testthat 3.0.2 2021-02-14 [1] CRAN (R 4.0.2)
## usethis 2.0.1 2021-02-10 [1] CRAN (R 4.0.2)
## withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.2)
## xfun 0.23 2021-05-15 [1] CRAN (R 4.0.2)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
- The basic
%>%
pipe is automatically imported as part of thetidyverse
library. If you wish to use any of the extra tools frommagrittr
as demonstrated in R for Data Science, you need to explicitly loadmagrittr
. ^