Last year, over 2500 immigrant children were separated from their family while in government custody. Information about their status is scattered across several government agencies, and throughout the national class-action lawsuit “Ms. L vs ICE,” the Analytics team of the ACLU has been using R to join, deduplicate, validate, and analyze it. Using specifics of this case, this talk will address common challenges arising from human-generated data in spreadsheets. With generalizable examples, I will discuss data tidying, standardization, deduplication, and validation using the Tidyverse, janitor, assertthat, and other packages. Finally, I will share best practices for requesting useful data from non-quantitative subject matter experts.

Subscribe to more inspiring open-source data science content.

We love to celebrate and help people do great data science. By subscribing, you'll get alerted whenever we publish something new.