Pivoting data

Jeff Stevens

2023-02-24

Review

How do I wrangle that?

  • return subset of rows based on position in data frame
  • return subset of rows based on column values
  • return subset of columns based on position in data frame
  • return subset of columns based on name
  • reorder rows by column values
  • reorder columns manually
  • create new columns
  • aggregate rows with summary functions

Mental model of data analysis

The problem

What’s different between these data sets?

What needs to happen to create data2 from data1?

data1
# A tibble: 4 × 4
     id  cond1  cond2  cond3
  <int>  <dbl>  <dbl>  <dbl>
1     1 0.197  0.871  0.623 
2     2 0.0414 0.0636 0.0596
3     3 0.880  0.523  0.366 
4     4 0.913  0.530  0.625 
data2
   id condition   response
1   1     cond1 0.19666041
2   1     cond2 0.87096339
3   1     cond3 0.62344725
4   2     cond1 0.04137505
5   2     cond2 0.06362335
6   2     cond3 0.05964197
7   3     cond1 0.88014147
8   3     cond2 0.52252628
9   3     cond3 0.36627275
10  4     cond1 0.91305613
11  4     cond2 0.52961833
12  4     cond3 0.62510585

Tidy data

Mental model of tidy data

Tidying data with {tidyr} and {dplyr}

What is tidy data?

  1. Each variable has its own column

  2. Each observation has its own row

  3. Each value has its own cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Tidy data

  • Think about tidy from a model perspective

  • Tidyverse assumes tidy data

  • Easier to analyze and plot tidy data

  • But sometimes easier to store non-tidy data

Pivoting data

Pivoting data

Pivoting data

Wide data

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Why is table4a not tidy?

Wide data

Use pivot_longer()

pivot_longer(table4a, cols = c(`1999`, `2000`), 
             names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Long data

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Why is table2 not tidy?

Long data

Use pivot_wider()

pivot_wider(table2, id_cols = c("country", "year"), 
            names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Solving the problem

What code turns data1 into data2? And vice versa?

data1
# A tibble: 4 × 4
     id  cond1  cond2  cond3
  <int>  <dbl>  <dbl>  <dbl>
1     1 0.197  0.871  0.623 
2     2 0.0414 0.0636 0.0596
3     3 0.880  0.523  0.366 
4     4 0.913  0.530  0.625 
data2
   id condition   response
1   1     cond1 0.19666041
2   1     cond2 0.87096339
3   1     cond3 0.62344725
4   2     cond1 0.04137505
5   2     cond2 0.06362335
6   2     cond3 0.05964197
7   3     cond1 0.88014147
8   3     cond2 0.52252628
9   3     cond3 0.36627275
10  4     cond1 0.91305613
11  4     cond2 0.52961833
12  4     cond3 0.62510585

Let’s code!

Pivoting data [Rmd]