2023-03-06
What’s different between these data sets?
What is needed to create data2 from data1?
data1# A tibble: 12 × 3
     val1   val2    val3
    <dbl>  <dbl>   <dbl>
 1 0.773  0.470  0.00431
 2 0.827  0.751  0.00923
 3 0.746  0.220  0.00814
 4 0.953  0.199  0.00767
 5 0.298  0.894  0.00221
 6 0.860  0.0149 0.00499
 7 0.0460 0.956  0.00779
 8 0.947  0.162  0.00875
 9 0.511  0.189  0.00986
10 0.712  0.0969 0.00862
11 0.944  0.370  0.00209
12 0.834  0.585  0.00420data2# A tibble: 12 × 3
   percent_val1 log_val2 val3   
          <dbl>    <dbl> <chr>  
 1         77.3  -0.756  4.3e-03
 2         82.7  -0.287  9.2e-03
 3         74.6  -1.51   8.1e-03
 4         95.3  -1.61   7.7e-03
 5         29.8  -0.113  2.2e-03
 6         86.0  -4.20   5.0e-03
 7          4.6  -0.0452 7.8e-03
 8         94.7  -1.82   8.7e-03
 9         51.1  -1.67   9.9e-03
10         71.2  -2.33   8.6e-03
11         94.4  -0.994  2.1e-03
12         83.4  -0.536  4.2e-03Doubles are floating point numbers
Note
Floating point number: a number without a fixed number of digits after the decimal point
Floating point numbers ≈ scientific notation
Computer memory is limited, so you cannot store numbers with infinite precision and floating points are stored imprecisely
As a reminder, we’ve already seen how to use dplyr::count()
count(flights, carrier)# A tibble: 16 × 2
   carrier     n
   <chr>   <int>
 1 9E      18460
 2 AA      32729
 3 AS        714
 4 B6      54635
 5 DL      48110
 6 EV      54173
 7 F9        685
 8 FL       3260
 9 HA        342
10 MQ      26397
11 OO         32
12 UA      58665
13 US      20536
14 VX       5162
15 WN      12275
16 YV        601We can also automatically sort by count.
count(flights, carrier, sort = TRUE)# A tibble: 16 × 2
   carrier     n
   <chr>   <int>
 1 UA      58665
 2 B6      54635
 3 EV      54173
 4 DL      48110
 5 AA      32729
 6 MQ      26397
 7 US      20536
 8 9E      18460
 9 WN      12275
10 VX       5162
11 FL       3260
12 AS        714
13 F9        685
14 YV        601
15 HA        342
16 OO         32And sum up totals instead of just count
count(flights, carrier, wt = distance)# A tibble: 16 × 2
   carrier        n
   <chr>      <dbl>
 1 9E       9788152
 2 AA      43864584
 3 AS       1715028
 4 B6      58384137
 5 DL      59507317
 6 EV      30498951
 7 F9       1109700
 8 FL       2167344
 9 HA       1704186
10 MQ      15033955
11 OO         16026
12 UA      89705524
13 US      11365778
14 VX      12902327
15 WN      12229203
16 YV        225395Remember n() counts inside a summarise()
n_distinct() counts instances within a group
flights |> 
  group_by(dest) |> 
  summarise(carriers = n_distinct(carrier))# A tibble: 105 × 2
   dest  carriers
   <chr>    <int>
 1 ABQ          1
 2 ACK          1
 3 ALB          1
 4 ANC          1
 5 ATL          7
 6 AUS          6
 7 AVL          2
 8 BDL          2
 9 BGR          2
10 BHM          1
# … with 95 more rowsThis trick can be used for any logical vector
Mathematical operators are recycled across all elements in a vector
0:10 * 5 [1]  0  5 10 15 20 25 30 35 40 45 50 [1]  0.000000  3.162278  4.472136  5.477226  6.324555  7.071068  7.745967
 [8]  8.366600  8.944272  9.486833 10.000000 [1] -0.3921567 -0.3017750 -2.4044192 -0.6672081 -0.3630729 -1.9276563
 [7] -1.0672688 -4.2340097 -0.8112143 -1.0013072 [1] 0.8077234 1.1635879 0.1599809 0.8139120 0.5528058 0.6495491 1.2270613
 [8] 0.9685195 0.7673122 0.8240033Control significant digits with round()
When numbers get too big, too small, or need other formatting, use format()
If you need to bin numbers into ranges, use cut()
 [1] 26.550866 37.212390 57.285336 90.820779 20.168193 89.838968 94.467527
 [8] 66.079779 62.911404  6.178627 20.597457 17.655675 [1] (0,33]   (33,66]  (33,66]  (66,100] (0,33]   (66,100] (66,100] (66,100]
 [9] (33,66]  (0,33]   (0,33]   (0,33]  
Levels: (0,33] (33,66] (66,100]data1# A tibble: 12 × 3
     val1   val2    val3
    <dbl>  <dbl>   <dbl>
 1 0.773  0.470  0.00431
 2 0.827  0.751  0.00923
 3 0.746  0.220  0.00814
 4 0.953  0.199  0.00767
 5 0.298  0.894  0.00221
 6 0.860  0.0149 0.00499
 7 0.0460 0.956  0.00779
 8 0.947  0.162  0.00875
 9 0.511  0.189  0.00986
10 0.712  0.0969 0.00862
11 0.944  0.370  0.00209
12 0.834  0.585  0.00420data2# A tibble: 12 × 3
   percent_val1 log_val2 val3   
          <dbl>    <dbl> <chr>  
 1         77.3  -0.756  4.3e-03
 2         82.7  -0.287  9.2e-03
 3         74.6  -1.51   8.1e-03
 4         95.3  -1.61   7.7e-03
 5         29.8  -0.113  2.2e-03
 6         86.0  -4.20   5.0e-03
 7          4.6  -0.0452 7.8e-03
 8         94.7  -1.82   8.7e-03
 9         51.1  -1.67   9.9e-03
10         71.2  -2.33   8.6e-03
11         94.4  -0.994  2.1e-03
12         83.4  -0.536  4.2e-03