Categorical data, called “factor” data in R, presents unique challenges in data wrangling. R users often look down at tools like Excel for automatically coercing variables to incorrect datatypes, but factor data in R can produce very similar issues. The stringsAsFactors=HELLNO movement and standard Tidyverse defaults have moved us away from the use of factors, but they are sometimes still necessary for analysis. This talk will outline common problems arising from categorical variable transformations in R, and show strategies to avoid them, using both base R and the Tidyverse (particularly, dplyr and forcats functions).

View Materials

(related paper from the DSS collection)

Subscribe to more inspiring open-source data science content.

We love to celebrate and help people do great data science. By subscribing, you'll get alerted whenever we publish something new.