Mutating columns

Jeff Stevens

2023-02-15

Review

Data wrangling

Mutating columns

Mental model of mutating columns

Set-up

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Changing and creating columns

dplyr::mutate()

Changing columns

Changing existing columns

Changing existing columns

mutate(flights, month = as.character(month))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <chr> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013 1         1      517        515       2     830     819      11 UA     
 2  2013 1         1      533        529       4     850     830      20 UA     
 3  2013 1         1      542        540       2     923     850      33 AA     
 4  2013 1         1      544        545      -1    1004    1022     -18 B6     
 5  2013 1         1      554        600      -6     812     837     -25 DL     
 6  2013 1         1      554        558      -4     740     728      12 UA     
 7  2013 1         1      555        600      -5     913     854      19 B6     
 8  2013 1         1      557        600      -3     709     723     -14 EV     
 9  2013 1         1      557        600      -3     838     846      -8 B6     
10  2013 1         1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Changing existing columns

Conditional changes

ifelse()

mutate(flights, month = ifelse(month < 10, # conditional statement
                               paste0("0", month), # what to do if TRUE
                               as.character(month))) # what to do if FALSE
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <chr> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013 01        1      517        515       2     830     819      11 UA     
 2  2013 01        1      533        529       4     850     830      20 UA     
 3  2013 01        1      542        540       2     923     850      33 AA     
 4  2013 01        1      544        545      -1    1004    1022     -18 B6     
 5  2013 01        1      554        600      -6     812     837     -25 DL     
 6  2013 01        1      554        558      -4     740     728      12 UA     
 7  2013 01        1      555        600      -5     913     854      19 B6     
 8  2013 01        1      557        600      -3     709     723     -14 EV     
 9  2013 01        1      557        600      -3     838     846      -8 B6     
10  2013 01        1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Changing existing columns

Conditional changes

dplyr::if_else()

mutate(flights, month = if_else(month < 10,  # conditional statement
                                paste0("0", month),  # what to do if TRUE
                                as.character(month),  # what to do if FALSE
                                NA)) # what to do if missing
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <chr> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013 01        1      517        515       2     830     819      11 UA     
 2  2013 01        1      533        529       4     850     830      20 UA     
 3  2013 01        1      542        540       2     923     850      33 AA     
 4  2013 01        1      544        545      -1    1004    1022     -18 B6     
 5  2013 01        1      554        600      -6     812     837     -25 DL     
 6  2013 01        1      554        558      -4     740     728      12 UA     
 7  2013 01        1      555        600      -5     913     854      19 B6     
 8  2013 01        1      557        600      -3     709     723     -14 EV     
 9  2013 01        1      557        600      -3     838     846      -8 B6     
10  2013 01        1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Changing existing columns

Multiple changes

mutate(flights, 
       month = if_else(month < 10, paste0("0", month), as.character(month), NA),
       day = if_else(day < 10, paste0("0", day), as.character(day), NA)
)
# A tibble: 336,776 × 19
    year month day   dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <chr> <chr>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013 01    01         517        515       2     830     819      11 UA     
 2  2013 01    01         533        529       4     850     830      20 UA     
 3  2013 01    01         542        540       2     923     850      33 AA     
 4  2013 01    01         544        545      -1    1004    1022     -18 B6     
 5  2013 01    01         554        600      -6     812     837     -25 DL     
 6  2013 01    01         554        558      -4     740     728      12 UA     
 7  2013 01    01         555        600      -5     913     854      19 B6     
 8  2013 01    01         557        600      -3     709     723     -14 EV     
 9  2013 01    01         557        600      -3     838     846      -8 B6     
10  2013 01    01         558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Creating columns

Creating new columns

Creating new columns

mutate(flights, speed = distance / air_time * 60)
# A tibble: 336,776 × 20
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 10 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, speed <dbl>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Creating new columns

Move column when creating

mutate(flights, speed = distance / air_time * 60, .after = day)
# A tibble: 336,776 × 20
    year month   day speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
   <int> <int> <int> <dbl>    <int>        <int>   <dbl>   <int>   <int>   <dbl>
 1  2013     1     1  370.      517          515       2     830     819      11
 2  2013     1     1  374.      533          529       4     850     830      20
 3  2013     1     1  408.      542          540       2     923     850      33
 4  2013     1     1  517.      544          545      -1    1004    1022     -18
 5  2013     1     1  394.      554          600      -6     812     837     -25
 6  2013     1     1  288.      554          558      -4     740     728      12
 7  2013     1     1  404.      555          600      -5     913     854      19
 8  2013     1     1  259.      557          600      -3     709     723     -14
 9  2013     1     1  405.      557          600      -3     838     846      -8
10  2013     1     1  319.      558          600      -2     753     745       8
# … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Creating new columns

Conditionals with multiple outcomes

dplyr::case_when

mutate(flights, season = case_when(month %in% c(3:5) ~ "spring",
                                   month %in% c(6:8) ~ "summer",
                                   month %in% c(9:11) ~ "fall",
                                   month %in% c(12, 1:2) ~ "winter"),
       .after = day)
# A tibble: 336,776 × 20
    year month   day season dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
   <int> <int> <int> <chr>     <int>       <int>   <dbl>   <int>   <int>   <dbl>
 1  2013     1     1 winter      517         515       2     830     819      11
 2  2013     1     1 winter      533         529       4     850     830      20
 3  2013     1     1 winter      542         540       2     923     850      33
 4  2013     1     1 winter      544         545      -1    1004    1022     -18
 5  2013     1     1 winter      554         600      -6     812     837     -25
 6  2013     1     1 winter      554         558      -4     740     728      12
 7  2013     1     1 winter      555         600      -5     913     854      19
 8  2013     1     1 winter      557         600      -3     709     723     -14
 9  2013     1     1 winter      557         600      -3     838     846      -8
10  2013     1     1 winter      558         600      -2     753     745       8
# … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Creating new columns

Multiple columns

mutate(flights, speed = distance / air_time * 60, .after = dep_time) %>%
  mutate(month = if_else(month < 10, paste0("0", month), as.character(month), NA),
         day = if_else(day < 10, paste0("0", day), as.character(day), NA),
         date = paste(year, month, day, sep = "-"), 
         .after = day)
# A tibble: 336,776 × 21
    year month day   date  dep_t…¹ speed sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
   <int> <chr> <chr> <chr>   <int> <dbl>   <int>   <dbl>   <int>   <int>   <dbl>
 1  2013 01    01    2013…     517  370.     515       2     830     819      11
 2  2013 01    01    2013…     533  374.     529       4     850     830      20
 3  2013 01    01    2013…     542  408.     540       2     923     850      33
 4  2013 01    01    2013…     544  517.     545      -1    1004    1022     -18
 5  2013 01    01    2013…     554  394.     600      -6     812     837     -25
 6  2013 01    01    2013…     554  288.     558      -4     740     728      12
 7  2013 01    01    2013…     555  404.     600      -5     913     854      19
 8  2013 01    01    2013…     557  259.     600      -3     709     723     -14
 9  2013 01    01    2013…     557  405.     600      -3     838     846      -8
10  2013 01    01    2013…     558  319.     600      -2     753     745       8
# … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time, ⁵​sched_arr_time,
#   ⁶​arr_delay

Creating new columns

Remove old columns

mutate(flights, date = paste(year, month, day, sep = "-"), 
       .before = 1, 
       .keep = "unused")
# A tibble: 336,776 × 17
   date   dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum
   <chr>    <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
 1 2013-…     517     515       2     830     819      11 UA        1545 N14228 
 2 2013-…     533     529       4     850     830      20 UA        1714 N24211 
 3 2013-…     542     540       2     923     850      33 AA        1141 N619AA 
 4 2013-…     544     545      -1    1004    1022     -18 B6         725 N804JB 
 5 2013-…     554     600      -6     812     837     -25 DL         461 N668DN 
 6 2013-…     554     558      -4     740     728      12 UA        1696 N39463 
 7 2013-…     555     600      -5     913     854      19 B6         507 N516JB 
 8 2013-…     557     600      -3     709     723     -14 EV        5708 N829AS 
 9 2013-…     557     600      -3     838     846      -8 B6          79 N593JB 
10 2013-…     558     600      -2     753     745       8 AA         301 N3ALAA 
# … with 336,766 more rows, 7 more variables: origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#   and abbreviated variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay,
#   ⁴​arr_time, ⁵​sched_arr_time, ⁶​arr_delay

Creating new columns

Keep only new column

mutate(flights, date = paste(year, month, day, sep = "-"), 
       .keep = "none")
# A tibble: 336,776 × 1
   date    
   <chr>   
 1 2013-1-1
 2 2013-1-1
 3 2013-1-1
 4 2013-1-1
 5 2013-1-1
 6 2013-1-1
 7 2013-1-1
 8 2013-1-1
 9 2013-1-1
10 2013-1-1
# … with 336,766 more rows

Working with multiple columns

Apply functions to multiple columns

mutate(flights, min_dep_time = min(dep_time, sched_dep_time, na.rm = TRUE), 
       .after = arr_time)
# A tibble: 336,776 × 20
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ min_d…⁴ sched…⁵ arr_d…⁶
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <int>   <dbl>
 1  2013     1     1      517        515       2     830       1     819      11
 2  2013     1     1      533        529       4     850       1     830      20
 3  2013     1     1      542        540       2     923       1     850      33
 4  2013     1     1      544        545      -1    1004       1    1022     -18
 5  2013     1     1      554        600      -6     812       1     837     -25
 6  2013     1     1      554        558      -4     740       1     728      12
 7  2013     1     1      555        600      -5     913       1     854      19
 8  2013     1     1      557        600      -3     709       1     723     -14
 9  2013     1     1      557        600      -3     838       1     846      -8
10  2013     1     1      558        600      -2     753       1     745       8
# … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​min_dep_time, ⁵​sched_arr_time,
#   ⁶​arr_delay

Apply functions to multiple columns

dplyr::rowwise()

rowwise(flights) %>% 
  mutate(min_dep_time = min(dep_time, sched_dep_time), .after = arr_time)
# A tibble: 336,776 × 20
# Rowwise: 
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ min_d…⁴ sched…⁵ arr_d…⁶
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <int>   <dbl>
 1  2013     1     1      517        515       2     830     515     819      11
 2  2013     1     1      533        529       4     850     529     830      20
 3  2013     1     1      542        540       2     923     540     850      33
 4  2013     1     1      544        545      -1    1004     544    1022     -18
 5  2013     1     1      554        600      -6     812     554     837     -25
 6  2013     1     1      554        558      -4     740     554     728      12
 7  2013     1     1      555        600      -5     913     555     854      19
 8  2013     1     1      557        600      -3     709     557     723     -14
 9  2013     1     1      557        600      -3     838     557     846      -8
10  2013     1     1      558        600      -2     753     558     745       8
# … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​min_dep_time, ⁵​sched_arr_time,
#   ⁶​arr_delay

Changing multiple columns

dplyr::across()

mutate(flights, across(contains("_time"), as.character))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int> <chr>    <chr>        <dbl> <chr>   <chr>     <dbl> <chr>  
 1  2013     1     1 517      515              2 830     819          11 UA     
 2  2013     1     1 533      529              4 850     830          20 UA     
 3  2013     1     1 542      540              2 923     850          33 AA     
 4  2013     1     1 544      545             -1 1004    1022        -18 B6     
 5  2013     1     1 554      600             -6 812     837         -25 DL     
 6  2013     1     1 554      558             -4 740     728          12 UA     
 7  2013     1     1 555      600             -5 913     854          19 B6     
 8  2013     1     1 557      600             -3 709     723         -14 EV     
 9  2013     1     1 557      600             -3 838     846          -8 B6     
10  2013     1     1 558      600             -2 753     745           8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <chr>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Changing multiple columns

dplyr::across()

What if you need to pass arguments to your function?

mutate(flights, across(contains("_time"), ~ .x / 60))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
 1  2013     1     1     8.62       8.58       2    13.8    13.6      11 UA     
 2  2013     1     1     8.88       8.82       4    14.2    13.8      20 UA     
 3  2013     1     1     9.03       9          2    15.4    14.2      33 AA     
 4  2013     1     1     9.07       9.08      -1    16.7    17.0     -18 B6     
 5  2013     1     1     9.23      10         -6    13.5    14.0     -25 DL     
 6  2013     1     1     9.23       9.3       -4    12.3    12.1      12 UA     
 7  2013     1     1     9.25      10         -5    15.2    14.2      19 B6     
 8  2013     1     1     9.28      10         -3    11.8    12.0     -14 EV     
 9  2013     1     1     9.28      10         -3    14.0    14.1      -8 B6     
10  2013     1     1     9.3       10         -2    12.6    12.4       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Changing multiple columns

dplyr::across()

What if you need to pass arguments to your function?

print(mutate(flights, across(contains("_time"), ~ .x / 60)), n = 5)

Note

  • Start with ~
  • Replace where the column name should be with .x
  • Note dividing these numbers by 60 doesn’t make sense—just an example

Mental model of mutating columns

Let’s code!

Mutating data [Rmd]