Pins in Databricks
If you haven’t heard of pins, it’s an amazing tool available for R and Python that lets you publish data, models, and other objects for easy sharing across projects and with your colleagues. You “pin” an object to a board, like Dropbox, Posit Connect, Amazon S3, or others, to track changes, re-run analyses on historical data, and undo mistakes. Read more about pins here.
At a recent Data Science Hangout with Chris Engelhardt, we heard great feedback from pins users and excitement about the possibility of using Databricks as a board. Databricks Unity Catalog offers a governance layer to manage access to data assets. R users could connect to their Unity Catalog using packages such as odbc or sparklyr, then prepare, analyze, and model their data. The resulting outputs could be saved as a pin, stored within Databricks in its secure and efficient data environment, and made accessible for versioning, collaboration, reporting, and further analysis without having to move or resave the data elsewhere.
Since that conversation, we have merged support for board_databricks() into the pins R package! Now, you can access and store pins directly in Databricks Volumes from your R script.
According to Databricks + pins user jmbarbone, some key advantages of this setup include:
- Your storage location is closer to where your data is processed, so you’re not transferring files out of Databricks
- Organizations can control access to pinned objects using Databricks’ data governance features
- Databricks Volumes can handle unstructured data, which we detail below
- Databricks Volumes can hold multiple files and folders in a directory-like structure
Let’s explore how this integration opens new possibilities for Databricks users.
Storing non-traditional data objects on Databricks
Databricks users work with pins to save various types of objects as files. This includes the common use cases of pins, such as saving CSVs or Parquet files, but also less traditional data objects such as binary model files (saved as .rds) or SPSS files (saved as .sav).
Example Use Case: Saving and Accessing SPSS Files
SPSS .sav files (readable in R via the haven package) often contain metadata, such as labels as ordered factors, which are crucial for understanding the data. Traditional storage options, such as SQL tables, would strip this information. Pins, however, preserve it. The metadata remains intact when the file is retrieved.
For example, a user can run the following script to run the General Social Survey 2012 results to their Volume:
library(pins)
library(haven)
# Connect to an existing Databricks Volume
# DATABRICKS_HOST is configured in .Renviron
board <- board_databricks("/Volumes/workshops/information-schema/volumes")
# Read the GSS data
gss2012 <- read_sav("https://study.sagepub.com/sites/default/files/gss2012.sav")
# Process the GSS data here
# Write the GSS data to a pin
board |>
pin_write(gss2012)They can then use pin_read() to pull in the pin in another project or script, or share it with a colleague, while retaining the metadata.
For more information about using this new functionality, see: https://pins.rstudio.com/reference/board_databricks.html
Retaining user-level governance on Databricks
Authentication to Databricks in pins works with tokens, with three different options:
DATABRICKS_TOKENenvironment variableCONNECT_DATABRICKS_TOKENenvironment variable- OAuth Databricks token via the RStudio API
When publishing pins to the Databricks Volume board, you retain your user-level governance, maintaining the granular permissions and access controls offered by Databricks and ensuring that only authorized individuals can view or modify data stored in pins. This differs from pinning objects to other boards, such as Amazon S3, which would retain the user-level governance specified on those particular platforms.
Learn more about using Databricks with Posit Projects
We’re so excited to bring the flexibility of pins to the governance of Databricks. Ready to get started? Install the latest version of the pins R package and try out board_databricks() today!
Do you want to learn more about using Posit tools together with Databricks? Book a call with our team.