Data Wrangling and Tidying
September 18, 2025
00. Quick review
01. How do I aggregate by collapsing?
02. How do I aggregate without collapsing?
03. How do I tidy data?
| Task | {dplyr} verb |
|---|---|
| Subset rows | filter() |
| Subset columns | select() |
| Sort | arrange() |
| Create a new variable | mutate() |
|> or %>%# A tibble: 582 × 4
name album duration_ms duration_min
<chr> <chr> <dbl> <dbl>
1 Fortnight (feat. Post Malone) THE TORTU… 228965 3.82
2 The Tortured Poets Department THE TORTU… 293048 4.88
3 My Boy Only Breaks His Favorite Toys THE TORTU… 203801 3.40
4 Down Bad THE TORTU… 261228 4.35
5 So Long, London THE TORTU… 262974 4.38
6 But Daddy I Love Him THE TORTU… 340428 5.67
7 Fresh Out The Slammer THE TORTU… 210789 3.51
8 Florida!!! (feat. Florence + The Machine) THE TORTU… 215463 3.59
9 Guilty as Sin? THE TORTU… 254365 4.24
10 Who’s Afraid of Little Old Me? THE TORTU… 334084 5.57
# ℹ 572 more rows
if_else()Syntax:
# A tibble: 582 × 3
name energy energy_level
<chr> <dbl> <chr>
1 Fortnight (feat. Post Malone) 0.386 low
2 The Tortured Poets Department 0.428 low
3 My Boy Only Breaks His Favorite Toys 0.563 low
4 Down Bad 0.366 low
5 So Long, London 0.533 low
6 But Daddy I Love Him 0.72 high
7 Fresh Out The Slammer 0.483 low
8 Florida!!! (feat. Florence + The Machine) 0.573 low
9 Guilty as Sin? 0.428 low
10 Who’s Afraid of Little Old Me? 0.338 low
# ℹ 572 more rows
02:00
| by album |
| by long_short |
| by release_date |
Counting observations by group
# A tibble: 29 × 2
album n
<chr> <int>
1 1989 13
2 1989 (Deluxe) 19
3 1989 (Taylor's Version) 21
4 1989 (Taylor's Version) [Deluxe] 22
5 Fearless (International Version) 16
6 Fearless (Platinum Edition) 19
7 Fearless (Taylor's Version) 26
8 Live From Clear Channel Stripped 2008 8
9 Lover 18
10 Midnights 13
# ℹ 19 more rows
Counting observations by group
# A tibble: 2 × 2
long_short n
<chr> <int>
1 long 180
2 short 402
Other calculations?
Use group_by() and summarise() / summarize()
# A tibble: 29 × 2
album avg_energy
<chr> <dbl>
1 1989 0.699
2 1989 (Deluxe) 0.625
3 1989 (Taylor's Version) 0.662
4 1989 (Taylor's Version) [Deluxe] 0.661
5 Fearless (International Version) 0.636
6 Fearless (Platinum Edition) 0.600
7 Fearless (Taylor's Version) 0.639
8 Live From Clear Channel Stripped 2008 0.599
9 Lover 0.545
10 Midnights 0.409
# ℹ 19 more rows
You can add on a count using n()
# A tibble: 29 × 3
album avg_energy n
<chr> <dbl> <int>
1 1989 0.699 13
2 1989 (Deluxe) 0.625 19
3 1989 (Taylor's Version) 0.662 21
4 1989 (Taylor's Version) [Deluxe] 0.661 22
5 Fearless (International Version) 0.636 16
6 Fearless (Platinum Edition) 0.600 19
7 Fearless (Taylor's Version) 0.639 26
8 Live From Clear Channel Stripped 2008 0.599 8
9 Lover 0.545 18
10 Midnights 0.409 13
# ℹ 19 more rows
Group by multiple variables
# A tibble: 56 × 3
# Groups: album [29]
album long_short med_pop
<chr> <chr> <dbl>
1 1989 long 42
2 1989 short 53
3 1989 (Deluxe) long 55
4 1989 (Deluxe) short 52
5 1989 (Taylor's Version) long 69.5
6 1989 (Taylor's Version) short 73
7 1989 (Taylor's Version) [Deluxe] long 62
8 1989 (Taylor's Version) [Deluxe] short 62.5
9 Fearless (International Version) long 30
10 Fearless (International Version) short 31
# ℹ 46 more rows
Important: If the column you are basing your calculation on includes NA, your result will be NA
Important: If the column you are basing your calculation on includes NA, your result will be NA
01:30
group_by() and mutate()Adds a new variable to tswift
# A tibble: 582 × 4
# Groups: album [29]
name album duration_ms total_duration
<chr> <chr> <dbl> <dbl>
1 Welcome To New York 1989 212600 2927864
2 Blank Space 1989 231826 2927864
3 Style 1989 231000 2927864
4 Out Of The Woods 1989 235800 2927864
5 All You Had To Do Was Stay 1989 193293 2927864
6 Shake It Off 1989 219200 2927864
7 I Wish You Would 1989 207440 2927864
8 Bad Blood 1989 211933 2927864
9 Wildest Dreams 1989 220440 2927864
10 How You Get The Girl 1989 247533 2927864
11 This Love 1989 250093 2927864
12 I Know Places 1989 195706 2927864
13 Clean 1989 271000 2927864
14 Welcome To New York 1989 (Delu… 212600 4125648
15 Blank Space 1989 (Delu… 231826 4125648
16 Style 1989 (Delu… 231000 4125648
17 Out Of The Woods 1989 (Delu… 235800 4125648
18 All You Had To Do Was Stay 1989 (Delu… 193293 4125648
19 Shake It Off 1989 (Delu… 219200 4125648
20 I Wish You Would 1989 (Delu… 207440 4125648
21 Bad Blood 1989 (Delu… 211933 4125648
22 Wildest Dreams 1989 (Delu… 220440 4125648
23 How You Get The Girl 1989 (Delu… 247533 4125648
24 This Love 1989 (Delu… 250093 4125648
25 I Know Places 1989 (Delu… 195706 4125648
26 Clean 1989 (Delu… 271000 4125648
27 Wonderland 1989 (Delu… 245560 4125648
28 You Are In Love 1989 (Delu… 267106 4125648
29 New Romantics 1989 (Delu… 230466 4125648
30 I Know Places - Voice Memo 1989 (Delu… 216333 4125648
31 I Wish You Would - Voice Memo 1989 (Delu… 107133 4125648
32 Blank Space - Voice Memo 1989 (Delu… 131186 4125648
33 Welcome To New York (Taylor's Version) 1989 (Tayl… 212600 4678327
34 Blank Space (Taylor's Version) 1989 (Tayl… 231833 4678327
35 Style (Taylor's Version) 1989 (Tayl… 231000 4678327
# ℹ 547 more rows
Adds two new variables to tswift
# A tibble: 582 × 5
# Groups: album [29]
name album duration_ms total_duration percent_duration
<chr> <chr> <dbl> <dbl> <dbl>
1 Welcome To New York 1989 212600 2927864 0.0726
2 Blank Space 1989 231826 2927864 0.0792
3 Style 1989 231000 2927864 0.0789
4 Out Of The Woods 1989 235800 2927864 0.0805
5 All You Had To Do Was Stay 1989 193293 2927864 0.0660
6 Shake It Off 1989 219200 2927864 0.0749
7 I Wish You Would 1989 207440 2927864 0.0709
8 Bad Blood 1989 211933 2927864 0.0724
9 Wildest Dreams 1989 220440 2927864 0.0753
10 How You Get The Girl 1989 247533 2927864 0.0845
11 This Love 1989 250093 2927864 0.0854
12 I Know Places 1989 195706 2927864 0.0668
13 Clean 1989 271000 2927864 0.0926
14 Welcome To New York 1989… 212600 4125648 0.0515
15 Blank Space 1989… 231826 4125648 0.0562
16 Style 1989… 231000 4125648 0.0560
17 Out Of The Woods 1989… 235800 4125648 0.0572
18 All You Had To Do Was Stay 1989… 193293 4125648 0.0469
19 Shake It Off 1989… 219200 4125648 0.0531
20 I Wish You Would 1989… 207440 4125648 0.0503
21 Bad Blood 1989… 211933 4125648 0.0514
22 Wildest Dreams 1989… 220440 4125648 0.0534
23 How You Get The Girl 1989… 247533 4125648 0.0600
24 This Love 1989… 250093 4125648 0.0606
25 I Know Places 1989… 195706 4125648 0.0474
26 Clean 1989… 271000 4125648 0.0657
27 Wonderland 1989… 245560 4125648 0.0595
28 You Are In Love 1989… 267106 4125648 0.0647
29 New Romantics 1989… 230466 4125648 0.0559
30 I Know Places - Voice Memo 1989… 216333 4125648 0.0524
31 I Wish You Would - Voice M… 1989… 107133 4125648 0.0260
32 Blank Space - Voice Memo 1989… 131186 4125648 0.0318
33 Welcome To New York (Taylo… 1989… 212600 4678327 0.0454
34 Blank Space (Taylor's Vers… 1989… 231833 4678327 0.0496
35 Style (Taylor's Version) 1989… 231000 4678327 0.0494
# ℹ 547 more rows
02:30
Task
Alter the tswift data frame to add a variable that calculates average acousticness by album (without collapsing).
Bonus: Can you determine if the song “Cruel Summer” is more or less acoustic than the Lover album average?

Is this data frame tidy?
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
What about now?
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
dairy_wide
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
dairy_long
# A tibble: 15 × 3
region category metric_tons
<chr> <chr> <dbl>
1 Mexico cheese 110819
2 Mexico milk 224890
3 Mexico lactose 22054
4 South Korea cheese 43628
5 South Korea milk 1945
6 South Korea lactose 8957
7 Japan cheese 37428
8 Japan milk 1099
9 Japan lactose 27981
10 Australia and NZ cheese 22705
11 Australia and NZ milk 2045
12 Australia and NZ lactose 37925
13 Canada cheese 13343
14 Canada milk 1825
15 Canada lactose 4184
03:00
Task
Pivot the nat_parks data frame longer so that year and visitors each make a column.
Hint: Pivot the year columns only. To specify them, you can use either of these structures cols = start:stop
cols = -c(column1, column2)
Hint
# A tibble: 5 × 4
region cheese milk lactose
<chr> <dbl> <dbl> <dbl>
1 Mexico 110819 224890 22054
2 South Korea 43628 1945 8957
3 Japan 37428 1099 27981
4 Australia and NZ 22705 2045 37925
5 Canada 13343 1825 4184
Try the extended exercises!
Come next week!
Email askdata@duke.edu
Cheatsheets