2023-03-06
What’s different between these data sets?
What is needed to create data2
from data1
?
data1
# A tibble: 12 × 3
val1 val2 val3
<dbl> <dbl> <dbl>
1 0.773 0.470 0.00431
2 0.827 0.751 0.00923
3 0.746 0.220 0.00814
4 0.953 0.199 0.00767
5 0.298 0.894 0.00221
6 0.860 0.0149 0.00499
7 0.0460 0.956 0.00779
8 0.947 0.162 0.00875
9 0.511 0.189 0.00986
10 0.712 0.0969 0.00862
11 0.944 0.370 0.00209
12 0.834 0.585 0.00420
data2
# A tibble: 12 × 3
percent_val1 log_val2 val3
<dbl> <dbl> <chr>
1 77.3 -0.756 4.3e-03
2 82.7 -0.287 9.2e-03
3 74.6 -1.51 8.1e-03
4 95.3 -1.61 7.7e-03
5 29.8 -0.113 2.2e-03
6 86.0 -4.20 5.0e-03
7 4.6 -0.0452 7.8e-03
8 94.7 -1.82 8.7e-03
9 51.1 -1.67 9.9e-03
10 71.2 -2.33 8.6e-03
11 94.4 -0.994 2.1e-03
12 83.4 -0.536 4.2e-03
Doubles are floating point numbers
Floating point numbers ≈ scientific notation
Computer memory is limited, so you cannot store numbers with infinite precision and floating points are stored imprecisely
As a reminder, we’ve already seen how to use dplyr::count()
count(flights, carrier)
# A tibble: 16 × 2
carrier n
<chr> <int>
1 9E 18460
2 AA 32729
3 AS 714
4 B6 54635
5 DL 48110
6 EV 54173
7 F9 685
8 FL 3260
9 HA 342
10 MQ 26397
11 OO 32
12 UA 58665
13 US 20536
14 VX 5162
15 WN 12275
16 YV 601
We can also automatically sort by count.
count(flights, carrier, sort = TRUE)
# A tibble: 16 × 2
carrier n
<chr> <int>
1 UA 58665
2 B6 54635
3 EV 54173
4 DL 48110
5 AA 32729
6 MQ 26397
7 US 20536
8 9E 18460
9 WN 12275
10 VX 5162
11 FL 3260
12 AS 714
13 F9 685
14 YV 601
15 HA 342
16 OO 32
And sum up totals instead of just count
count(flights, carrier, wt = distance)
# A tibble: 16 × 2
carrier n
<chr> <dbl>
1 9E 9788152
2 AA 43864584
3 AS 1715028
4 B6 58384137
5 DL 59507317
6 EV 30498951
7 F9 1109700
8 FL 2167344
9 HA 1704186
10 MQ 15033955
11 OO 16026
12 UA 89705524
13 US 11365778
14 VX 12902327
15 WN 12229203
16 YV 225395
Remember n()
counts inside a summarise()
n_distinct()
counts instances within a group
flights |>
group_by(dest) |>
summarise(carriers = n_distinct(carrier))
# A tibble: 105 × 2
dest carriers
<chr> <int>
1 ABQ 1
2 ACK 1
3 ALB 1
4 ANC 1
5 ATL 7
6 AUS 6
7 AVL 2
8 BDL 2
9 BGR 2
10 BHM 1
# … with 95 more rows
This trick can be used for any logical vector
Mathematical operators are recycled across all elements in a vector
0:10 * 5
[1] 0 5 10 15 20 25 30 35 40 45 50
[1] 0.000000 3.162278 4.472136 5.477226 6.324555 7.071068 7.745967
[8] 8.366600 8.944272 9.486833 10.000000
[1] -0.3921567 -0.3017750 -2.4044192 -0.6672081 -0.3630729 -1.9276563
[7] -1.0672688 -4.2340097 -0.8112143 -1.0013072
[1] 0.8077234 1.1635879 0.1599809 0.8139120 0.5528058 0.6495491 1.2270613
[8] 0.9685195 0.7673122 0.8240033
Control significant digits with round()
When numbers get too big, too small, or need other formatting, use format()
If you need to bin numbers into ranges, use cut()
[1] 26.550866 37.212390 57.285336 90.820779 20.168193 89.838968 94.467527
[8] 66.079779 62.911404 6.178627 20.597457 17.655675
[1] (0,33] (33,66] (33,66] (66,100] (0,33] (66,100] (66,100] (66,100]
[9] (33,66] (0,33] (0,33] (0,33]
Levels: (0,33] (33,66] (66,100]
data1
# A tibble: 12 × 3
val1 val2 val3
<dbl> <dbl> <dbl>
1 0.773 0.470 0.00431
2 0.827 0.751 0.00923
3 0.746 0.220 0.00814
4 0.953 0.199 0.00767
5 0.298 0.894 0.00221
6 0.860 0.0149 0.00499
7 0.0460 0.956 0.00779
8 0.947 0.162 0.00875
9 0.511 0.189 0.00986
10 0.712 0.0969 0.00862
11 0.944 0.370 0.00209
12 0.834 0.585 0.00420
data2
# A tibble: 12 × 3
percent_val1 log_val2 val3
<dbl> <dbl> <chr>
1 77.3 -0.756 4.3e-03
2 82.7 -0.287 9.2e-03
3 74.6 -1.51 8.1e-03
4 95.3 -1.61 7.7e-03
5 29.8 -0.113 2.2e-03
6 86.0 -4.20 5.0e-03
7 4.6 -0.0452 7.8e-03
8 94.7 -1.82 8.7e-03
9 51.1 -1.67 9.9e-03
10 71.2 -2.33 8.6e-03
11 94.4 -0.994 2.1e-03
12 83.4 -0.536 4.2e-03