Learn R:

Data Wrangling and Tidying

McCall Pitcher
Center for Data and Visualization Sciences

January 23, 2026

Questions we will answer today

00. Quick review

01. How do I aggregate by collapsing?

02. How do I aggregate without collapsing?

03. How do I tidy data?

duke.is/LearnR-2-Spring2026

00.
Quick review

Quick review

Load {tidyverse} at the top of your coding notebook

# load tidyverse
library(tidyverse)

Import data using read_csv()

# load data
bowie <- read_csv("data/david_bowie_spotify.csv")
nat_parks <- read_csv("data/nat_parks_visitors.csv")

Quick review

Four key {dplyr} verbs

Task	{dplyr} verb
Subset rows	`filter()`
Subset columns	`select()`
Sort	`arrange()`
Create a new variable	`mutate()`

“Pipe” your data frame to these verbs using |> or %>%

Quick review

bowie_length <- bowie |> 
  select(track, album, duration_sec) |> 
  mutate(duration_min = duration_sec / 60)
  
bowie_length

# A tibble: 276 × 4
   track                 album       duration_sec duration_min
   <chr>                 <chr>              <dbl>        <dbl>
 1 Come And Buy My Toys  David Bowie          129         2.15
 2 When I Live My Dream  David Bowie          204         3.4 
 3 Love You Till Tuesday David Bowie          194         3.23
 4 Uncle Arthur          David Bowie          131         2.18
 5 Rubber Band           David Bowie          139         2.32
 6 Sell Me A Coat        David Bowie          182         3.03
 7 There Is A Happy Land David Bowie          197         3.28
 8 She's Got Medals      David Bowie          146         2.43
 9 We Are Hungry Men     David Bowie          180         3   
10 Maid Of Bond Street   David Bowie          105         1.75
# ℹ 266 more rows

Introducing `if_else()`

Syntax:

if_else(condition, true, false)

In action:

bowie <- bowie |> 
  mutate(energy_level = if_else(energy > .5, "high", "low"))

# A tibble: 276 × 3
   track                                energy energy_level
   <chr>                                 <dbl> <chr>       
 1 Come And Buy My Toys                  0.183 low         
 2 When I Live My Dream                  0.19  low         
 3 Love You Till Tuesday                 0.346 low         
 4 Uncle Arthur                          0.337 low         
 5 Rubber Band                           0.312 low         
 6 Sell Me A Coat                        0.256 low         
 7 There Is A Happy Land                 0.338 low         
 8 She's Got Medals                      0.417 low         
 9 We Are Hungry Men                     0.365 low         
10 Maid Of Bond Street                   0.326 low         
11 Please Mr. Gravedigger                0.523 high        
12 Join The Gang                         0.481 low         
13 Space Oddity                          0.403 low         
14 Letter to Hermione                    0.264 low         
15 God Knows I'm Good                    0.436 low         
16 Unwashed and Somewhat Slightly Dazed  0.755 high        
17 Cygnet Committee                      0.537 high        
18 Memory of a Free Festival             0.396 low         
19 Janine                                0.664 high        
20 An Occasional Dream                   0.341 low         
21 Wild Eyed Boy from Freecloud          0.3   low         
22 The Man Who Sold the World            0.387 low         
23 The Width of a Circle                 0.43  low         
24 All the Madmen                        0.424 low         
25 Black Country Rock                    0.588 high        
26 After All                             0.154 low         
27 Running Gun Blues                     0.404 low         
28 The Supermen                          0.594 high        
29 Saviour Machine                       0.58  high        
30 She Shook Me Cold                     0.531 high        
31 Life on Mars?                         0.384 low         
32 Changes                               0.308 low         
33 Oh! You Pretty Things                 0.277 low         
34 Queen Bitch                           0.637 high        
35 Kooks                                 0.278 low         
# ℹ 241 more rows

00. Your Turn!

02:00

Task

Using an if_else() statement, create a new variable in bowie that indicates if a song is “long” or “short”. Name the variable long_short.

Songs are considered “long” if duration_sec is greater than 250.

Hint

bowie <- bowie |> 
  mutate(energy_level = if_else(energy > .5,
                               "high",
                               "low"))

01.
How do I aggregate by collapsing?

How do I aggregate by collapsing?

Aggregation is gathering data to a higher level

Summarizes your data

What levels might we want to aggregate to in David Bowie’s data?

by album

by year

by time_signature

by key

by long_short

How do I aggregate by collapsing?

Counting observations by group

# by album
bowie |> 
  count(album)

# A tibble: 26 × 2
   album                                                      n
   <chr>                                                  <int>
 1 "\"Heroes\""                                              10
 2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)"    19
 3 "Aladdin Sane"                                            10
 4 "Black Tie White Noise"                                   12
 5 "Blackstar"                                                7
 6 "Buddha of Suburbia"                                      10
 7 "David Bowie"                                             12
 8 "David Bowie (aka Space Oddity)"                           9
 9 "Diamond Dogs"                                            11
10 "Earthling"                                                9
# ℹ 16 more rows

How do I aggregate by collapsing?

Counting observations by group

# by long_short
bowie |> 
  count(long_short)

# A tibble: 2 × 2
  long_short     n
  <chr>      <int>
1 long         142
2 short        134

How do I aggregate by collapsing?

Other calculations?
Use group_by() and summarise() / summarize()

# average energy, by album
bowie |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy))

# A tibble: 26 × 2
   album                                                  avg_energy
   <chr>                                                       <dbl>
 1 "\"Heroes\""                                                0.672
 2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)"      0.654
 3 "Aladdin Sane"                                              0.710
 4 "Black Tie White Noise"                                     0.804
 5 "Blackstar"                                                 0.753
 6 "Buddha of Suburbia"                                        0.667
 7 "David Bowie"                                               0.340
 8 "David Bowie (aka Space Oddity)"                            0.455
 9 "Diamond Dogs"                                              0.527
10 "Earthling"                                                 0.835
# ℹ 16 more rows

How do I aggregate by collapsing?

You can add on a count using n()

# average energy, by album
bowie |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy),
            n = n())

# A tibble: 26 × 3
   album                                                  avg_energy     n
   <chr>                                                       <dbl> <int>
 1 "\"Heroes\""                                                0.672    10
 2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycle)"      0.654    19
 3 "Aladdin Sane"                                              0.710    10
 4 "Black Tie White Noise"                                     0.804    12
 5 "Blackstar"                                                 0.753     7
 6 "Buddha of Suburbia"                                        0.667    10
 7 "David Bowie"                                               0.340    12
 8 "David Bowie (aka Space Oddity)"                            0.455     9
 9 "Diamond Dogs"                                              0.527    11
10 "Earthling"                                                 0.835     9
# ℹ 16 more rows

How do I aggregate by collapsing?

Other calculations too

# average energy, by album
bowie |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy),
            med_pop = median(popularity),
            n = n())

# A tibble: 26 × 4
   album                                                avg_energy med_pop     n
   <chr>                                                     <dbl>   <dbl> <int>
 1 "\"Heroes\""                                              0.672    35      10
 2 "1. Outside (The Nathan Adler Diaries: A Hyper Cycl…      0.654     9      19
 3 "Aladdin Sane"                                            0.710    43.5    10
 4 "Black Tie White Noise"                                   0.804    21.5    12
 5 "Blackstar"                                               0.753    41       7
 6 "Buddha of Suburbia"                                      0.667    17.5    10
 7 "David Bowie"                                             0.340    25      12
 8 "David Bowie (aka Space Oddity)"                          0.455    34       9
 9 "Diamond Dogs"                                            0.527    37      11
10 "Earthling"                                               0.835    16       9
# ℹ 16 more rows

How do I aggregate by collapsing?

Grouping by two variables

# average energy, by album
bowie |> 
  group_by(album, long_short) |> 
  summarise(avg_energy = mean(energy),
            med_pop = median(popularity),
            n = n())

# A tibble: 47 × 5
# Groups:   album [26]
   album                                     long_short avg_energy med_pop     n
   <chr>                                     <chr>           <dbl>   <dbl> <int>
 1 "\"Heroes\""                              long            0.430    34       3
 2 "\"Heroes\""                              short           0.775    35       7
 3 "1. Outside (The Nathan Adler Diaries: A… long            0.755     9      11
 4 "1. Outside (The Nathan Adler Diaries: A… short           0.516     6.5     8
 5 "Aladdin Sane"                            long            0.731    41       5
 6 "Aladdin Sane"                            short           0.690    47       5
 7 "Black Tie White Noise"                   long            0.823    22      11
 8 "Black Tie White Noise"                   short           0.592    21       1
 9 "Blackstar"                               long            0.753    41       7
10 "Buddha of Suburbia"                      long            0.667    17.5    10
# ℹ 37 more rows

How do I aggregate by collapsing?

Important: If the column you are basing your calculation on includes NA, your result will be NA

cereal_data

# A tibble: 5 × 3
  cereal         brand         bowls
  <chr>          <chr>         <dbl>
1 Cheerios       General Mills     3
2 Lucky Charms   General Mills     3
3 Frosted Flakes Kellogg's        NA
4 Fruit Loops    Kellogg's         4
5 Cocoa Krispies Kellogg's         1

# total bowls by brand
cereal_data |> 
 group_by(brand) |> 
 summarise(total = sum(bowls))

# A tibble: 2 × 2
  brand         total
  <chr>         <dbl>
1 General Mills     6
2 Kellogg's        NA

How do I aggregate by collapsing?

Important: If the column you are basing your calculation on includes NA, your result will be NA

cereal_data

# A tibble: 5 × 3
  cereal         brand         bowls
  <chr>          <chr>         <dbl>
1 Cheerios       General Mills     3
2 Lucky Charms   General Mills     3
3 Frosted Flakes Kellogg's        NA
4 Fruit Loops    Kellogg's         4
5 Cocoa Krispies Kellogg's         1

# total bowls by brand, excluding NA
cereal_data |> 
 group_by(brand) |> 
 summarise(total = sum(bowls, 
                       na.rm = TRUE))

# A tibble: 2 × 2
  brand         total
  <chr>         <dbl>
1 General Mills     6
2 Kellogg's         5

01. Your Turn!

01:30

Task

Using group_by() and summarise(), calculate average danceability by long_short.

On average, are David Bowie’s longer or shorter songs more “danceable”?

Hint

# average energy, by album
bowie |> 
  group_by(album) |> 
  summarise(avg_energy = mean(energy))

02.
How do I aggregate without collapsing?

How do I aggregate w/o collapsing?

Sometimes we need to aggregate, but don’t want to lose the granularity of our data

Use group_by() and mutate()

Performs the calculation by group, but keeps the same number of observations

How do I aggregate w/o collapsing?

Adds a new variable to bowie

# total duration, by album
bowie <- bowie |> 
  group_by(album) |> 
  mutate(total_duration = sum(duration_sec))

# A tibble: 276 × 4
# Groups:   album [26]
   track                                album        duration_sec total_duration
   <chr>                                <chr>               <dbl>          <dbl>
 1 Come And Buy My Toys                 David Bowie           129           1904
 2 When I Live My Dream                 David Bowie           204           1904
 3 Love You Till Tuesday                David Bowie           194           1904
 4 Uncle Arthur                         David Bowie           131           1904
 5 Rubber Band                          David Bowie           139           1904
 6 Sell Me A Coat                       David Bowie           182           1904
 7 There Is A Happy Land                David Bowie           197           1904
 8 She's Got Medals                     David Bowie           146           1904
 9 We Are Hungry Men                    David Bowie           180           1904
10 Maid Of Bond Street                  David Bowie           105           1904
11 Please Mr. Gravedigger               David Bowie           156           1904
12 Join The Gang                        David Bowie           141           1904
13 Space Oddity                         David Bowie…          318           2775
14 Letter to Hermione                   David Bowie…          156           2775
15 God Knows I'm Good                   David Bowie…          202           2775
16 Unwashed and Somewhat Slightly Dazed David Bowie…          412           2775
17 Cygnet Committee                     David Bowie…          579           2775
18 Memory of a Free Festival            David Bowie…          430           2775
19 Janine                               David Bowie…          205           2775
20 An Occasional Dream                  David Bowie…          180           2775
21 Wild Eyed Boy from Freecloud         David Bowie…          293           2775
22 The Man Who Sold the World           The Man Who…          241           2475
23 The Width of a Circle                The Man Who…          492           2475
24 All the Madmen                       The Man Who…          343           2475
25 Black Country Rock                   The Man Who…          216           2475
26 After All                            The Man Who…          240           2475
27 Running Gun Blues                    The Man Who…          195           2475
28 The Supermen                         The Man Who…          221           2475
29 Saviour Machine                      The Man Who…          270           2475
30 She Shook Me Cold                    The Man Who…          257           2475
31 Life on Mars?                        Hunky Dory            235           2505
32 Changes                              Hunky Dory            217           2505
33 Oh! You Pretty Things                Hunky Dory            193           2505
34 Queen Bitch                          Hunky Dory            200           2505
35 Kooks                                Hunky Dory            173           2505
# ℹ 241 more rows

How do I aggregate w/o collapsing?

Adds two new variables to bowie

# each song's share of total album
bowie <- bowie |> 
  group_by(album) |> 
  mutate(total_duration = sum(duration_sec),
         percent_duration = duration_sec / total_duration)

# A tibble: 276 × 5
# Groups:   album [26]
   track                      album duration_sec total_duration percent_duration
   <chr>                      <chr>        <dbl>          <dbl>            <dbl>
 1 Come And Buy My Toys       Davi…          129           1904           0.0678
 2 When I Live My Dream       Davi…          204           1904           0.107 
 3 Love You Till Tuesday      Davi…          194           1904           0.102 
 4 Uncle Arthur               Davi…          131           1904           0.0688
 5 Rubber Band                Davi…          139           1904           0.0730
 6 Sell Me A Coat             Davi…          182           1904           0.0956
 7 There Is A Happy Land      Davi…          197           1904           0.103 
 8 She's Got Medals           Davi…          146           1904           0.0767
 9 We Are Hungry Men          Davi…          180           1904           0.0945
10 Maid Of Bond Street        Davi…          105           1904           0.0551
11 Please Mr. Gravedigger     Davi…          156           1904           0.0819
12 Join The Gang              Davi…          141           1904           0.0741
13 Space Oddity               Davi…          318           2775           0.115 
14 Letter to Hermione         Davi…          156           2775           0.0562
15 God Knows I'm Good         Davi…          202           2775           0.0728
16 Unwashed and Somewhat Sli… Davi…          412           2775           0.148 
17 Cygnet Committee           Davi…          579           2775           0.209 
18 Memory of a Free Festival  Davi…          430           2775           0.155 
19 Janine                     Davi…          205           2775           0.0739
20 An Occasional Dream        Davi…          180           2775           0.0649
21 Wild Eyed Boy from Freecl… Davi…          293           2775           0.106 
22 The Man Who Sold the World The …          241           2475           0.0974
23 The Width of a Circle      The …          492           2475           0.199 
24 All the Madmen             The …          343           2475           0.139 
25 Black Country Rock         The …          216           2475           0.0873
26 After All                  The …          240           2475           0.0970
27 Running Gun Blues          The …          195           2475           0.0788
28 The Supermen               The …          221           2475           0.0893
29 Saviour Machine            The …          270           2475           0.109 
30 She Shook Me Cold          The …          257           2475           0.104 
31 Life on Mars?              Hunk…          235           2505           0.0938
32 Changes                    Hunk…          217           2505           0.0866
33 Oh! You Pretty Things      Hunk…          193           2505           0.0770
34 Queen Bitch                Hunk…          200           2505           0.0798
35 Kooks                      Hunk…          173           2505           0.0691
# ℹ 241 more rows

02. Your Turn!

02:30

Task

Alter the bowie data frame to add a variable that calculates average tempo by album (without collapsing).

Bonus: Can you determine if the tempo of the song “Changes” is higher or lower than the “Hunky Dory” album average?

Hint

# average duration, by album
bowie <- bowie |> 
  group_by(album) |> 
  mutate(avg_duration = mean(duration_sec))

03.
How do I tidy data?

How do I tidy data?

Same data can be stored in multiple structures/shapes

{dplyr} and {gpplot2} are designed to work with tidy data

What is tidy data?
- Variables are columns
- Observations are rows
- Values are cells

How do I tidy data?

Is this data frame tidy?

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

What about now?

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = c(cheese, milk, lactose),
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = 2:4,
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

How do I tidy data?

dairy_long <- dairy_wide |>
  pivot_longer(cols = -region,
               names_to = "category",
               values_to = "metric_tons")

dairy_wide

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

dairy_long

# A tibble: 15 × 3
   region           category metric_tons
   <chr>            <chr>          <dbl>
 1 Mexico           cheese        110819
 2 Mexico           milk          224890
 3 Mexico           lactose        22054
 4 South Korea      cheese         43628
 5 South Korea      milk            1945
 6 South Korea      lactose         8957
 7 Japan            cheese         37428
 8 Japan            milk            1099
 9 Japan            lactose        27981
10 Australia and NZ cheese         22705
11 Australia and NZ milk            2045
12 Australia and NZ lactose        37925
13 Canada           cheese         13343
14 Canada           milk            1825
15 Canada           lactose         4184

03. Your Turn!

03:00

Task

Pivot the nat_parks data frame longer so that year and visitors each make a column.

Hint: Pivot the year columns only. To specify them, you can use either of these structures
cols = start:stop
cols = -c(column1, column2)

Hint

# pivot dairy data
dairy_long <- dairy_wide |>
  pivot_longer(cols = 2:4,
               names_to = "category",
               values_to = "metric_tons")

# A tibble: 5 × 4
  region           cheese   milk lactose
  <chr>             <dbl>  <dbl>   <dbl>
1 Mexico           110819 224890   22054
2 South Korea       43628   1945    8957
3 Japan             37428   1099   27981
4 Australia and NZ  22705   2045   37925
5 Canada            13343   1825    4184

Resources for further learning

Try the extended exercises!
Come next week!
Email askdata@duke.edu
Cheatsheets
- Data transformation with dplyr
- Data tidying with tidyr
R for Data Science (2e)

Learn R:

Questions we will answer today

duke.is/LearnR-2-Spring2026

00. Quick review

Quick review

Quick review

Quick review

Introducing if_else()

00. Your Turn!

01. How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

How do I aggregate by collapsing?

01. Your Turn!

02. How do I aggregate without collapsing?

How do I aggregate w/o collapsing?

How do I aggregate w/o collapsing?

How do I aggregate w/o collapsing?

02. Your Turn!

03. How do I tidy data?

How do I tidy data?

How do I tidy data?

How do I tidy data?

How do I tidy data?

How do I tidy data?

03. Your Turn!

Resources for further learning

00.
Quick review

Introducing `if_else()`

01.
How do I aggregate by collapsing?

02.
How do I aggregate without collapsing?

03.
How do I tidy data?