Commercial enterprise offerings Open source packages - Quarto, Shiny, and more

tidyr 0.5.0

Written by Hadley Wickham

2016-06-13

Please note that the information presented in this post reflects the package as it stood when initially released, and may now be outdated. For the most up-to-date information, kindly refer to https://tidyr.tidyverse.org/.

I’m pleased to announce tidyr 0.5.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

This release has three useful new features:

separate_rows() separates values that contain multiple values separated by a delimited into multiple rows. Thanks to Aaron Wolen for the contribution!

df <- data_frame(x = 1:2, y = c("a,b", "d,e,f"))
df %>%
  separate_rows(y, sep = ",")
#> Source: local data frame [5 x 2]
#>
#>       x     y
#>   <int> <chr>
#> 1     1     a
#> 2     1     b
#> 3     2     d
#> 4     2     e
#> 5     2     f

Compare with separate() which separates into (named) columns:

df %>%
  separate(y, c("y1", "y2", "y3"), sep = ",", fill = "right")
#> Source: local data frame [2 x 4]
#>
#>       x    y1    y2    y3
#> * <int> <chr> <chr> <chr>
#> 1     1     a     b  <NA>
#> 2     2     d     e     f

spread() gains a sep argument. Setting this will name columns as “key|sep|value”. This is useful when you’re spreading based on a numeric column:

df <- data_frame(
  x = c(1, 2, 1),
  key = c(1, 1, 2),
  val = c("a", "b", "c")
)
df %>% spread(key, val)
#> Source: local data frame [2 x 3]
#>
#>       x     1     2
#> * <dbl> <chr> <chr>
#> 1     1     a     c
#> 2     2     b  <NA>
df %>% spread(key, val, sep = "_")
#> Source: local data frame [2 x 3]
#>
#>       x key_1 key_2
#> * <dbl> <chr> <chr>
#> 1     1     a     c
#> 2     2     b  <NA>

unnest() gains a .sep argument. This is useful if you have multiple columns of data frames that have the same variable names:

df <- data_frame(
  x = 1:2,
  y1 = list(
    data_frame(y = 1),
    data_frame(y = 2)
  ),
  y2 = list(
    data_frame(y = "a"),
    data_frame(y = "b")
  )
)
df %>% unnest()
#> Source: local data frame [2 x 3]
#>
#>       x     y     y
#>   <int> <dbl> <chr>
#> 1     1     1     a
#> 2     2     2     b
df %>% unnest(.sep = "_")
#> Source: local data frame [2 x 3]
#>
#>       x  y1_y  y2_y
#>   <int> <dbl> <chr>
#> 1     1     1     a
#> 2     2     2     b

It also gains a .id column that makes the names of the list explicit:

df <- data_frame(
  x = 1:2,
  y = list(
    a = 1:3,
    b = 3:1
  )
)
df %>% unnest()
#> Source: local data frame [6 x 2]
#>
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     1     3
#> 4     2     3
#> 5     2     2
#> 6     2     1
df %>% unnest(.id = "id")
#> Source: local data frame [6 x 3]
#>
#>       x     y    id
#>   <int> <int> <chr>
#> 1     1     1     a
#> 2     1     2     a
#> 3     1     3     a
#> 4     2     3     b
#> 5     2     2     b
#> 6     2     1     b

tidyr 0.5.0 also includes a bumper crop of bug fixes, including fixes for spread() and gather() in the presence of list-columns. Please see the release notes for a complete list of changes.

Hadley Wickham

Chief Scientist, Posit

Hadley is Chief Scientist at Posit PBC, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science.

tidyr 0.5.0

Hadley Wickham

Related Content

Don't bring a spreadsheet to a data fight

Snowflake Native Apps vs. Connected Apps for financial services

We prove model governance by live evidence, not paperwork