Last year, over 2500 immigrant children were separated from their family while in government custody. Information about their status is scattered across several government agencies, and throughout the national class-action lawsuit “Ms. L vs ICE,” the Analytics team of the ACLU has been using R to join, deduplicate, validate, and analyze it. Using specifics of this case, this talk will address common challenges arising from human-generated data in spreadsheets. With generalizable examples, I will discuss data tidying, standardization, deduplication, and validation using the Tidyverse, janitor, assertthat, and other packages. Finally, I will share best practices for requesting useful data from non-quantitative subject matter experts.
I am a Data Scientist at the ACLU, where I use code and statistics to support civil rights litigation and advocacy. Previously, I worked in public health and disease research, most recently as a Research Scientist with the EcoHealth Alliance. I completed my Master’s degree in epidemiology from the London School of Hygiene and Tropical Medicine and swam for Tennessee’s Lady Vols as an undergrad.