Data Wrangling and Tidying
January 23, 2026
00. Quick review
01. How do I aggregate by collapsing?
02. How do I aggregate without collapsing?
03. How do I tidy data?
| Task | {dplyr} verb |
|---|---|
| Subset rows | filter() |
| Subset columns | select() |
| Sort | arrange() |
| Create a new variable | mutate() |
|> or %>%# A tibble: 276 × 4
track album duration_sec duration_min
<chr> <chr> <dbl> <dbl>
1 Come And Buy My Toys David Bowie 129 2.15
2 When I Live My Dream David Bowie 204 3.4
3 Love You Till Tuesday David Bowie 194 3.23
4 Uncle Arthur David Bowie 131 2.18
5 Rubber Band David Bowie 139 2.32
6 Sell Me A Coat David Bowie 182 3.03
7 There Is A Happy Land David Bowie 197 3.28
8 She's Got Medals David Bowie 146 2.43
9 We Are Hungry Men David Bowie 180 3
10 Maid Of Bond Street David Bowie 105 1.75
# ℹ 266 more rows
if_else()Syntax:
# A tibble: 276 × 3
track energy energy_level
<chr> <dbl> <chr>
1 Come And Buy My Toys 0.183 low
2 When I Live My Dream 0.19 low
3 Love You Till Tuesday 0.346 low
4 Uncle Arthur 0.337 low
5 Rubber Band 0.312 low
6 Sell Me A Coat 0.256 low
7 There Is A Happy Land 0.338 low
8 She's Got Medals 0.417 low
9 We Are Hungry Men 0.365 low
10 Maid Of Bond Street 0.326 low
11 Please Mr. Gravedigger 0.523 high
12 Join The Gang 0.481 low
13 Space Oddity 0.403 low
14 Letter to Hermione 0.264 low
15 God Knows I'm Good 0.436 low
16 Unwashed and Somewhat Slightly Dazed 0.755 high
17 Cygnet Committee 0.537 high
18 Memory of a Free Festival 0.396 low
19 Janine 0.664 high
20 An Occasional Dream 0.341 low
21 Wild Eyed Boy from Freecloud 0.3 low
22 The Man Who Sold the World 0.387 low
23 The Width of a Circle 0.43 low
24 All the Madmen 0.424 low
25 Black Country Rock 0.588 high
26 After All 0.154 low
27 Running Gun Blues 0.404 low
28 The Supermen 0.594 high
29 Saviour Machine 0.58 high
30 She Shook Me Cold 0.531 high
31 Life on Mars? 0.384 low
32 Changes 0.308 low
33 Oh! You Pretty Things 0.277 low
34 Queen Bitch 0.637 high
35 Kooks 0.278 low
# ℹ 241 more rows
02:00
| by album |
| by year |
| by time_signature |
| by key |
| by long_short |
Counting observations by group
# A tibble: 26 × 2
album n
<chr> <int>
1 "\"Heroes\"" 10
2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)" 19
3 "Aladdin Sane" 10
4 "Black Tie White Noise" 12
5 "Blackstar" 7
6 "Buddha of Suburbia" 10
7 "David Bowie" 12
8 "David Bowie (aka Space Oddity)" 9
9 "Diamond Dogs" 11
10 "Earthling" 9
# ℹ 16 more rows
Counting observations by group
# A tibble: 2 × 2
long_short n
<chr> <int>
1 long 142
2 short 134
Other calculations?
Use group_by() and summarise() / summarize()
# A tibble: 26 × 2
album avg_energy
<chr> <dbl>
1 "\"Heroes\"" 0.672
2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)" 0.654
3 "Aladdin Sane" 0.710
4 "Black Tie White Noise" 0.804
5 "Blackstar" 0.753
6 "Buddha of Suburbia" 0.667
7 "David Bowie" 0.340
8 "David Bowie (aka Space Oddity)" 0.455
9 "Diamond Dogs" 0.527
10 "Earthling" 0.835
# ℹ 16 more rows
You can add on a count using n()
# A tibble: 26 × 3
album avg_energy n
<chr> <dbl> <int>
1 "\"Heroes\"" 0.672 10
2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)" 0.654 19
3 "Aladdin Sane" 0.710 10
4 "Black Tie White Noise" 0.804 12
5 "Blackstar" 0.753 7
6 "Buddha of Suburbia" 0.667 10
7 "David Bowie" 0.340 12
8 "David Bowie (aka Space Oddity)" 0.455 9
9 "Diamond Dogs" 0.527 11
10 "Earthling" 0.835 9
# ℹ 16 more rows
Other calculations too
# A tibble: 26 × 4
album avg_energy med_pop n
<chr> <dbl> <dbl> <int>
1 "\"Heroes\"" 0.672 35 10
2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycl… 0.654 9 19
3 "Aladdin Sane" 0.710 43.5 10
4 "Black Tie White Noise" 0.804 21.5 12
5 "Blackstar" 0.753 41 7
6 "Buddha of Suburbia" 0.667 17.5 10
7 "David Bowie" 0.340 25 12
8 "David Bowie (aka Space Oddity)" 0.455 34 9
9 "Diamond Dogs" 0.527 37 11
10 "Earthling" 0.835 16 9
# ℹ 16 more rows
Grouping by two variables
# A tibble: 47 × 5
# Groups: album [26]
album long_short avg_energy med_pop n
<chr> <chr> <dbl> <dbl> <int>
1 "\"Heroes\"" long 0.430 34 3
2 "\"Heroes\"" short 0.775 35 7
3 "1. Outside (The Nathan Adler Diaries: A… long 0.755 9 11
4 "1. Outside (The Nathan Adler Diaries: A… short 0.516 6.5 8
5 "Aladdin Sane" long 0.731 41 5
6 "Aladdin Sane" short 0.690 47 5
7 "Black Tie White Noise" long 0.823 22 11
8 "Black Tie White Noise" short 0.592 21 1
9 "Blackstar" long 0.753 41 7
10 "Buddha of Suburbia" long 0.667 17.5 10
# ℹ 37 more rows
Important: If the column you are basing your calculation on includes NA, your result will be NA
Important: If the column you are basing your calculation on includes NA, your result will be NA
01:30
group_by() and mutate()Adds a new variable to bowie
# A tibble: 276 × 4
# Groups: album [26]
track album duration_sec total_duration
<chr> <chr> <dbl> <dbl>
1 Come And Buy My Toys David Bowie 129 1904
2 When I Live My Dream David Bowie 204 1904
3 Love You Till Tuesday David Bowie 194 1904
4 Uncle Arthur David Bowie 131 1904
5 Rubber Band David Bowie 139 1904
6 Sell Me A Coat David Bowie 182 1904
7 There Is A Happy Land David Bowie 197 1904
8 She's Got Medals David Bowie 146 1904
9 We Are Hungry Men David Bowie 180 1904
10 Maid Of Bond Street David Bowie 105 1904
11 Please Mr. Gravedigger David Bowie 156 1904
12 Join The Gang David Bowie 141 1904
13 Space Oddity David Bowie… 318 2775
14 Letter to Hermione David Bowie… 156 2775
15 God Knows I'm Good David Bowie… 202 2775
16 Unwashed and Somewhat Slightly Dazed David Bowie… 412 2775
17 Cygnet Committee David Bowie… 579 2775
18 Memory of a Free Festival David Bowie… 430 2775
19 Janine David Bowie… 205 2775
20 An Occasional Dream David Bowie… 180 2775
21 Wild Eyed Boy from Freecloud David Bowie… 293 2775
22 The Man Who Sold the World The Man Who… 241 2475
23 The Width of a Circle The Man Who… 492 2475
24 All the Madmen The Man Who… 343 2475
25 Black Country Rock The Man Who… 216 2475
26 After All The Man Who… 240 2475
27 Running Gun Blues The Man Who… 195 2475
28 The Supermen The Man Who… 221 2475
29 Saviour Machine The Man Who… 270 2475
30 She Shook Me Cold The Man Who… 257 2475
31 Life on Mars? Hunky Dory 235 2505
32 Changes Hunky Dory 217 2505
33 Oh! You Pretty Things Hunky Dory 193 2505
34 Queen Bitch Hunky Dory 200 2505
35 Kooks Hunky Dory 173 2505
# ℹ 241 more rows
Adds two new variables to bowie
# A tibble: 276 × 5
# Groups: album [26]
track album duration_sec total_duration percent_duration
<chr> <chr> <dbl> <dbl> <dbl>
1 Come And Buy My Toys Davi… 129 1904 0.0678
2 When I Live My Dream Davi… 204 1904 0.107
3 Love You Till Tuesday Davi… 194 1904 0.102
4 Uncle Arthur Davi… 131 1904 0.0688
5 Rubber Band Davi… 139 1904 0.0730
6 Sell Me A Coat Davi… 182 1904 0.0956
7 There Is A Happy Land Davi… 197 1904 0.103
8 She's Got Medals Davi… 146 1904 0.0767
9 We Are Hungry Men Davi… 180 1904 0.0945
10 Maid Of Bond Street Davi… 105 1904 0.0551
11 Please Mr. Gravedigger Davi… 156 1904 0.0819
12 Join The Gang Davi… 141 1904 0.0741
13 Space Oddity Davi… 318 2775 0.115
14 Letter to Hermione Davi… 156 2775 0.0562
15 God Knows I'm Good Davi… 202 2775 0.0728
16 Unwashed and Somewhat Sli… Davi… 412 2775 0.148
17 Cygnet Committee Davi… 579 2775 0.209
18 Memory of a Free Festival Davi… 430 2775 0.155
19 Janine Davi… 205 2775 0.0739
20 An Occasional Dream Davi… 180 2775 0.0649
21 Wild Eyed Boy from Freecl… Davi… 293 2775 0.106
22 The Man Who Sold the World The … 241 2475 0.0974
23 The Width of a Circle The … 492 2475 0.199
24 All the Madmen The … 343 2475 0.139
25 Black Country Rock The … 216 2475 0.0873
26 After All The … 240 2475 0.0970
27 Running Gun Blues The … 195 2475 0.0788
28 The Supermen The … 221 2475 0.0893
29 Saviour Machine The … 270 2475 0.109
30 She Shook Me Cold The … 257 2475 0.104
31 Life on Mars? Hunk… 235 2505 0.0938
32 Changes Hunk… 217 2505 0.0866
33 Oh! You Pretty Things Hunk… 193 2505 0.0770
34 Queen Bitch Hunk… 200 2505 0.0798
35 Kooks Hunk… 173 2505 0.0691
# ℹ 241 more rows
02:30
Task
Alter the bowie data frame to add a variable that calculates average tempo by album (without collapsing).
Bonus: Can you determine if the tempo of the song “Changes” is higher or lower than the “Hunky Dory” album average?

Is this data frame tidy?
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
What about now?
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
03:00
Task
Pivot the nat_parks data frame longer so that year and visitors each make a column.
Hint: Pivot the year columns only. To specify them, you can use either of these structures cols = start:stop
cols = -c(column1, column2)
Hint
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
Try the extended exercises!
Come next week!
Email askdata@duke.edu
Cheatsheets