Like it or not, SQL is the closest thing we have to a universal language for working with structured data. Celebrating its 50th birthday in 2020, SQL today integrates with thousands of applications and has millions of users worldwide. Data analysts using SQL represent a large audience of potential R users motivated to expand their data science skills. But learning R can be frustrating for SQL users. One major frustration is the inability to directly query R data frames with SQL SELECT statements. Eager to use R for tasks that are not possible with SQL (like data visualization and machine learning), these users are dismayed to find that they must first learn an unfamiliar syntax for data manipulation. The popularity of the sqldf package (which automatically exports an R data frame into an embedded database, then runs a SQL query on it) demonstrates this frustration. But now there is a way to directly query an R data frame without moving the data out of R. In this talk, I introduce tidyquery, a new R package that runs SQL queries directly on R data frames. tidyquery is powered by dplyr and by queryparser, a new pure-R, no-dependency SQL query parser.
Ian Cook is a data science and machine learning educator at Cloudera and the creator of the Coursera course "Analyzing Big Data with SQL". He has authored and contributed to several R packages and has worked in data scientist roles at TIBCO Software and Advanced Micro Devices. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area. He received a masters degree in statistics from Lehigh University.