Learn R Workshop Part II

Data Wrangling and Tidying

McCall Pitcher
Center for Data and Visualization Sciences

September 18, 2025

Questions we will answer today


00. Quick review

01. How do I aggregate by collapsing?

02. How do I aggregate without collapsing?

03. How do I tidy data?

00.
Quick review

Quick review

  • Load {tidyverse} at the top of your coding notebook
# load tidyverse
library(tidyverse)


  • Import data using read_csv()
# load data
tswift <- read_csv("data/taylor_swift_spotify.csv")
nat_parks <- read_csv("data/nat_parks_visitors.csv")

Quick review

  • Four key {dplyr} verbs
Task {dplyr} verb
Subset rows filter()
Subset columns select()
Sort arrange()
Create a new variable mutate()


  • “Pipe” your data frame to these verbs using  |>   or    %>%

Quick review

tswift_dur <- tswift |> 
  select(name, album, duration_ms) |> 
  mutate(duration_min = duration_ms / 60000)
  
tswift_dur
# A tibble: 582 × 4
   name                                      album      duration_ms duration_min
   <chr>                                     <chr>            <dbl>        <dbl>
 1 Fortnight (feat. Post Malone)             THE TORTU…      228965         3.82
 2 The Tortured Poets Department             THE TORTU…      293048         4.88
 3 My Boy Only Breaks His Favorite Toys      THE TORTU…      203801         3.40
 4 Down Bad                                  THE TORTU…      261228         4.35
 5 So Long, London                           THE TORTU…      262974         4.38
 6 But Daddy I Love Him                      THE TORTU…      340428         5.67
 7 Fresh Out The Slammer                     THE TORTU…      210789         3.51
 8 Florida!!! (feat. Florence + The Machine) THE TORTU…      215463         3.59
 9 Guilty as Sin?                            THE TORTU…      254365         4.24
10 Who’s Afraid of Little Old Me?            THE TORTU…      334084         5.57
# ℹ 572 more rows

Introducing if_else()

Syntax:

if_else(condition, true, false)


In action:

tswift <- tswift |> 
  mutate(energy_level = if_else(energy > .6, "high", "low"))
# A tibble: 582 × 3
   name                                      energy energy_level
   <chr>                                      <dbl> <chr>       
 1 Fortnight (feat. Post Malone)              0.386 low         
 2 The Tortured Poets Department              0.428 low         
 3 My Boy Only Breaks His Favorite Toys       0.563 low         
 4 Down Bad                                   0.366 low         
 5 So Long, London                            0.533 low         
 6 But Daddy I Love Him                       0.72  high        
 7 Fresh Out The Slammer                      0.483 low         
 8 Florida!!! (feat. Florence + The Machine)  0.573 low         
 9 Guilty as Sin?                             0.428 low         
10 Who’s Afraid of Little Old Me?             0.338 low         
# ℹ 572 more rows

00. Your Turn!

02:00

Task

Using an if_else() statement, create a new variable in tswift that indicates if a song is “long” or “short”. Name the variable long_short.


Songs are considered “long” if duration_ms is greater than 250000.

Hint

tswift <- tswift |> 
  mutate(energy_level = if_else(energy > .6,
                               "high",
                               "low"))

01.
How do I aggregate by collapsing?

How do I aggregate by collapsing?

  • Aggregation is gathering data to a higher level
  • Summarizes your data
  • What levels might we want to aggregate to in Taylor Swift’s data?
by album
by long_short
by release_date

How do I aggregate by collapsing?

Counting observations by group

# by album
tswift |> 
  count(album)
# A tibble: 29 × 2
   album                                     n
   <chr>                                 <int>
 1 1989                                     13
 2 1989 (Deluxe)                            19
 3 1989 (Taylor's Version)                  21
 4 1989 (Taylor's Version) [Deluxe]         22
 5 Fearless (International Version)         16
 6 Fearless (Platinum Edition)              19
 7 Fearless (Taylor's Version)              26
 8 Live From Clear Channel Stripped 2008     8
 9 Lover                                    18
10 Midnights                                13
# ℹ 19 more rows

How do I aggregate by collapsing?

Counting observations by group

# by long_short
tswift |> 
  count(long_short)
# A tibble: 2 × 2
  long_short     n
  <chr>      <int>
1 long         180
2 short        402

How do I aggregate by collapsing?

Other calculations?
Use group_by() and summarise()  /  summarize()

# average energy, by album
tswift |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy))
# A tibble: 29 × 2
   album                                 avg_energy
   <chr>                                      <dbl>
 1 1989                                       0.699
 2 1989 (Deluxe)                              0.625
 3 1989 (Taylor's Version)                    0.662
 4 1989 (Taylor's Version) [Deluxe]           0.661
 5 Fearless (International Version)           0.636
 6 Fearless (Platinum Edition)                0.600
 7 Fearless (Taylor's Version)                0.639
 8 Live From Clear Channel Stripped 2008      0.599
 9 Lover                                      0.545
10 Midnights                                  0.409
# ℹ 19 more rows

How do I aggregate by collapsing?

You can add on a count using n()

# average energy, by album
tswift |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy),
            n = n())
# A tibble: 29 × 3
   album                                 avg_energy     n
   <chr>                                      <dbl> <int>
 1 1989                                       0.699    13
 2 1989 (Deluxe)                              0.625    19
 3 1989 (Taylor's Version)                    0.662    21
 4 1989 (Taylor's Version) [Deluxe]           0.661    22
 5 Fearless (International Version)           0.636    16
 6 Fearless (Platinum Edition)                0.600    19
 7 Fearless (Taylor's Version)                0.639    26
 8 Live From Clear Channel Stripped 2008      0.599     8
 9 Lover                                      0.545    18
10 Midnights                                  0.409    13
# ℹ 19 more rows

How do I aggregate by collapsing?

Group by multiple variables

# median popularity, by album and long_short
tswift |> 
  group_by(album, long_short) |> 
  summarise(med_pop = median(popularity))
# A tibble: 56 × 3
# Groups:   album [29]
   album                            long_short med_pop
   <chr>                            <chr>        <dbl>
 1 1989                             long          42  
 2 1989                             short         53  
 3 1989 (Deluxe)                    long          55  
 4 1989 (Deluxe)                    short         52  
 5 1989 (Taylor's Version)          long          69.5
 6 1989 (Taylor's Version)          short         73  
 7 1989 (Taylor's Version) [Deluxe] long          62  
 8 1989 (Taylor's Version) [Deluxe] short         62.5
 9 Fearless (International Version) long          30  
10 Fearless (International Version) short         31  
# ℹ 46 more rows

How do I aggregate by collapsing?

Important: If the column you are basing your calculation on includes NA, your result will be NA

cereal_data
# A tibble: 5 × 3
  cereal         brand         bowls
  <chr>          <chr>         <dbl>
1 Cheerios       General Mills     3
2 Lucky Charms   General Mills     3
3 Frosted Flakes Kellogg's        NA
4 Fruit Loops    Kellogg's         4
5 Cocoa Krispies Kellogg's         1
# total bowls by brand
cereal_data |> 
 group_by(brand) |> 
 summarise(total = sum(bowls))
# A tibble: 2 × 2
  brand         total
  <chr>         <dbl>
1 General Mills     6
2 Kellogg's        NA

How do I aggregate by collapsing?

Important: If the column you are basing your calculation on includes NA, your result will be NA

cereal_data
# A tibble: 5 × 3
  cereal         brand         bowls
  <chr>          <chr>         <dbl>
1 Cheerios       General Mills     3
2 Lucky Charms   General Mills     3
3 Frosted Flakes Kellogg's        NA
4 Fruit Loops    Kellogg's         4
5 Cocoa Krispies Kellogg's         1
# total bowls by brand, excluding NA
cereal_data |> 
 group_by(brand) |> 
 summarise(total = sum(bowls, 
                       na.rm = TRUE))
# A tibble: 2 × 2
  brand         total
  <chr>         <dbl>
1 General Mills     6
2 Kellogg's         5

01. Your Turn!

01:30

Task

Using group_by() and summarise(), calculate average danceability by long_short.


On average, are Taylor Swift’s longer or shorter songs more “danceable”?

Hint

# average energy, by album
tswift |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy))

02.
How do I aggregate without collapsing?

How do I aggregate w/o collapsing?

  • Sometimes we need to aggregate, but don’t want to lose the granularity of our data
  • Use group_by() and mutate()
  • Performs the calculation by group, but keeps the same number of observations

How do I aggregate w/o collapsing?

Adds a new variable to tswift

# total duration, by album
tswift <- tswift |> 
  group_by(album) |> 
  mutate(total_duration = sum(duration_ms))
# A tibble: 582 × 4
# Groups:   album [29]
   name                                   album       duration_ms total_duration
   <chr>                                  <chr>             <dbl>          <dbl>
 1 Welcome To New York                    1989             212600        2927864
 2 Blank Space                            1989             231826        2927864
 3 Style                                  1989             231000        2927864
 4 Out Of The Woods                       1989             235800        2927864
 5 All You Had To Do Was Stay             1989             193293        2927864
 6 Shake It Off                           1989             219200        2927864
 7 I Wish You Would                       1989             207440        2927864
 8 Bad Blood                              1989             211933        2927864
 9 Wildest Dreams                         1989             220440        2927864
10 How You Get The Girl                   1989             247533        2927864
11 This Love                              1989             250093        2927864
12 I Know Places                          1989             195706        2927864
13 Clean                                  1989             271000        2927864
14 Welcome To New York                    1989 (Delu…      212600        4125648
15 Blank Space                            1989 (Delu…      231826        4125648
16 Style                                  1989 (Delu…      231000        4125648
17 Out Of The Woods                       1989 (Delu…      235800        4125648
18 All You Had To Do Was Stay             1989 (Delu…      193293        4125648
19 Shake It Off                           1989 (Delu…      219200        4125648
20 I Wish You Would                       1989 (Delu…      207440        4125648
21 Bad Blood                              1989 (Delu…      211933        4125648
22 Wildest Dreams                         1989 (Delu…      220440        4125648
23 How You Get The Girl                   1989 (Delu…      247533        4125648
24 This Love                              1989 (Delu…      250093        4125648
25 I Know Places                          1989 (Delu…      195706        4125648
26 Clean                                  1989 (Delu…      271000        4125648
27 Wonderland                             1989 (Delu…      245560        4125648
28 You Are In Love                        1989 (Delu…      267106        4125648
29 New Romantics                          1989 (Delu…      230466        4125648
30 I Know Places - Voice Memo             1989 (Delu…      216333        4125648
31 I Wish You Would - Voice Memo          1989 (Delu…      107133        4125648
32 Blank Space - Voice Memo               1989 (Delu…      131186        4125648
33 Welcome To New York (Taylor's Version) 1989 (Tayl…      212600        4678327
34 Blank Space (Taylor's Version)         1989 (Tayl…      231833        4678327
35 Style (Taylor's Version)               1989 (Tayl…      231000        4678327
# ℹ 547 more rows

How do I aggregate w/o collapsing?

Adds two new variables to tswift

# each song's share of total album
tswift <- tswift |> 
  group_by(album) |> 
  mutate(total_duration = sum(duration_ms),
         percent_duration = duration_ms / total_duration)
# A tibble: 582 × 5
# Groups:   album [29]
   name                        album duration_ms total_duration percent_duration
   <chr>                       <chr>       <dbl>          <dbl>            <dbl>
 1 Welcome To New York         1989       212600        2927864           0.0726
 2 Blank Space                 1989       231826        2927864           0.0792
 3 Style                       1989       231000        2927864           0.0789
 4 Out Of The Woods            1989       235800        2927864           0.0805
 5 All You Had To Do Was Stay  1989       193293        2927864           0.0660
 6 Shake It Off                1989       219200        2927864           0.0749
 7 I Wish You Would            1989       207440        2927864           0.0709
 8 Bad Blood                   1989       211933        2927864           0.0724
 9 Wildest Dreams              1989       220440        2927864           0.0753
10 How You Get The Girl        1989       247533        2927864           0.0845
11 This Love                   1989       250093        2927864           0.0854
12 I Know Places               1989       195706        2927864           0.0668
13 Clean                       1989       271000        2927864           0.0926
14 Welcome To New York         1989…      212600        4125648           0.0515
15 Blank Space                 1989…      231826        4125648           0.0562
16 Style                       1989…      231000        4125648           0.0560
17 Out Of The Woods            1989…      235800        4125648           0.0572
18 All You Had To Do Was Stay  1989…      193293        4125648           0.0469
19 Shake It Off                1989…      219200        4125648           0.0531
20 I Wish You Would            1989…      207440        4125648           0.0503
21 Bad Blood                   1989…      211933        4125648           0.0514
22 Wildest Dreams              1989…      220440        4125648           0.0534
23 How You Get The Girl        1989…      247533        4125648           0.0600
24 This Love                   1989…      250093        4125648           0.0606
25 I Know Places               1989…      195706        4125648           0.0474
26 Clean                       1989…      271000        4125648           0.0657
27 Wonderland                  1989…      245560        4125648           0.0595
28 You Are In Love             1989…      267106        4125648           0.0647
29 New Romantics               1989…      230466        4125648           0.0559
30 I Know Places - Voice Memo  1989…      216333        4125648           0.0524
31 I Wish You Would - Voice M… 1989…      107133        4125648           0.0260
32 Blank Space - Voice Memo    1989…      131186        4125648           0.0318
33 Welcome To New York (Taylo… 1989…      212600        4678327           0.0454
34 Blank Space (Taylor's Vers… 1989…      231833        4678327           0.0496
35 Style (Taylor's Version)    1989…      231000        4678327           0.0494
# ℹ 547 more rows

02. Your Turn!

02:30

Task

Alter the tswift data frame to add a variable that calculates average acousticness by album (without collapsing).

Bonus: Can you determine if the song “Cruel Summer” is more or less acoustic than the Lover album average?

Hint

# average duration, by album
tswift <- tswift |> 
  group_by(album) |> 
  mutate(avg_duration = mean(duration_ms))

03.
How do I tidy data?

How do I tidy data?

  • Same data can be stored in multiple structures/shapes
  • {dplyr} and {gpplot2} are designed to work with tidy data
  • What is tidy data?
    • Variables are columns
    • Observations are rows
    • Values are cells

R for Data Science (2e)

How do I tidy data?

Is this data frame tidy?

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

What about now?

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = c(cheese, milk, lactose),
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = 2:4,
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = -region,
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

03. Your Turn!

03:00

Task

Pivot the nat_parks data frame longer so that year and visitors each make a column.

Hint: Pivot the year columns only. To specify them, you can use either of these structures
cols = start:stop
cols = -c(column1, column2)

Hint

# pivot dairy data
dairy_long <- dairy_wide |>
  pivot_longer(cols = 2:4,
               names_to = "category",
               values_to = "metric_tons")
# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

Resources for further learning