Piping

Jeff Stevens

2023-02-17

Review

Data wrangling

Piping

Set-up

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Base R pipelines

myflights <- flights[c("year", "month", "day", "air_time", "distance", 
                       "hour", "minute")]
myflights$month <- as.character(myflights$month)
myflights$month <- ifelse(myflights$month < 10, 
                          paste0("0", myflights$month), myflights$month)
myflights$day <- ifelse(myflights$day < 10, 
                        paste0("0", myflights$day), myflights$day)
myflights$date <- paste(myflights$year, myflights$month, myflights$day, 
                        sep = "-")
myflights$speed <- myflights$distance / myflights$air_time * 60
myflights <- myflights[c("year", "month", "day", "date", "air_time", 
                         "distance", "speed", "hour", "minute")]

What do you like and dislike about this pipeline?

tidyverse pipelines

myflights2 <- flights |> 
  select(year:day, air_time, distance, hour, minute) |> 
  mutate(month = as.character(month),
         month = if_else(month < 10, paste0("0", month), as.character(month)),
         day = if_else(day < 10, paste0("0", day), as.character(day)),
         date = paste(year, month, day, sep = "-"),
         speed = distance / air_time * 60) |> 
  select(year:day, date, air_time, distance, speed, everything())

What do you like and dislike about this pipeline?

tidyverse pipelines

myflights3 <- flights |> 
  select(year:day, air_time, distance, hour, minute) |> 
  mutate(month = as.character(month),
         month = if_else(month < 10, paste0("0", month), as.character(month)),
         day = if_else(day < 10, paste0("0", day), as.character(day)),
         date = paste(year, month, day, sep = "-"),
         .after = day) |> 
  mutate(speed = distance / air_time * 60,
         .after = distance)

What do you like and dislike about this pipeline?

Pipeline comparison

identical(myflights, myflights2)
[1] TRUE
identical(myflights, myflights3)
[1] TRUE
identical(myflights2, myflights3)
[1] TRUE

Character counts

Pipeline Characters
myflights 566
myflights2 423
myflights3 406

Pipes

  • Base R pipe |>
    • added in R 4.1.0 but key functionality started in 4.2.0
    • works following most base R and tidyverse functions
  • tidyverse pipe %>%
    • from {magrittr} package
    • works following tidyverse verbs
  • Hadley Wickham recommends using the base R pipe |>, so we’ll use that here.

Piping basics

Start with the data object…

flights |> 
  select(year:dep_delay, origin) |> # include these columns
  select(-sched_dep_time) # exclude this column

Or use data object as the first argument…

select(flights, year:dep_delay, origin) |> # include these columns
  select(-sched_dep_time) # exclude this column

But don’t use data object after first pipe

select(flights, year:dep_delay, origin) |> # include these columns
  select(flights, -sched_dep_time) # exclude this column
Error in `select()`:
! Can't subset columns with `flights`.
✖ `flights` must be numeric or character, not a <tbl_df/tbl/data.frame> object.

Piping basics

Like any object, assigning it does not output to console

myflights <- flights |> 
  select(year:dep_delay, origin) |>
  select(-sched_dep_time)

But omitting assignment does

flights |> 
  select(year:dep_delay, origin) |>
  select(-sched_dep_time)
# A tibble: 336,776 × 6
    year month   day dep_time dep_delay origin
   <int> <int> <int>    <int>     <dbl> <chr> 
 1  2013     1     1      517         2 EWR   
 2  2013     1     1      533         4 LGA   
 3  2013     1     1      542         2 JFK   
 4  2013     1     1      544        -1 JFK   
 5  2013     1     1      554        -6 LGA   
 6  2013     1     1      554        -4 EWR   
 7  2013     1     1      555        -5 EWR   
 8  2013     1     1      557        -3 LGA   
 9  2013     1     1      557        -3 JFK   
10  2013     1     1      558        -2 LGA   
# … with 336,766 more rows

Piping basics

As does wrapping assignment in parentheses

(myflights <- flights |> 
  select(year:dep_delay, origin) |>
  select(-sched_dep_time))
# A tibble: 336,776 × 6
    year month   day dep_time dep_delay origin
   <int> <int> <int>    <int>     <dbl> <chr> 
 1  2013     1     1      517         2 EWR   
 2  2013     1     1      533         4 LGA   
 3  2013     1     1      542         2 JFK   
 4  2013     1     1      544        -1 JFK   
 5  2013     1     1      554        -6 LGA   
 6  2013     1     1      554        -4 EWR   
 7  2013     1     1      555        -5 EWR   
 8  2013     1     1      557        -3 LGA   
 9  2013     1     1      557        -3 JFK   
10  2013     1     1      558        -2 LGA   
# … with 336,766 more rows

Advanced piping

  • Sometimes, non-tidyverse functions don’t take the data object as the first argument

  • This requires a “placeholder” signaling where the data object goes

  • The placeholder for the |> pipe is _

  • The placeholder for the %>% pipe is .

Advanced piping

Base R pipe

mtcars |> 
  select(mpg, cyl) |> 
  lm(mpg ~ cyl)
Error in as.data.frame.default(data) : 
  cannot coerce class ‘"formula"’ to a data.frame
mtcars |> 
  select(mpg, cyl) |> 
  lm(mpg ~ cyl, data = _)

Call:
lm(formula = mpg ~ cyl, data = select(mtcars, mpg, cyl))

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  
  • You must specify the argument name to use placeholder

Advanced piping

tidyverse pipe

mtcars %>% 
  select(mpg, cyl) %>% 
  lm(mpg ~ cyl, data = .)

Call:
lm(formula = mpg ~ cyl, data = .)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

Let’s code!

Piping [Rmd]