Commercial enterprise offerings

Pysparklyr for interacting with Spark & Databricks Connect

Written by Posit Team
2023-12-20
The Connections pane in RStudio showing the Unity Catalog tables from Databricks

We are thrilled to announce that the new version of pysparklyr is on CRAN!

Pysparklyr is the new extension to sparklyr that allows you to interact with Spark & Databricks Connect. We discussed sparklyr updates in our recent blog post and provided detailed instructions for working with remote clusters in the package documentation.

The new version of pysparklyr has several big user-facing updates that make working with Databricks and R together even easier.

To install the latest version of pysparklyr, run the code below:

install.packages("pysparklyr")

Let’s dive in!

Installation prompt

Previously, first-time users needed to run install_databricks(). With the new version of pysparklyr, this step is not needed. If there’s no existing environment, spark_connect() will automatically prompt for installation, making the instructions much easier to follow.

library(sparklyr)

sc <- spark_connect(
  cluster_id = "Enter here your cluster ID",
  method = "databricks_connect"
)

Read more in the documentation: First Time Connecting

RStudio Snippet for Databricks Connections

In the new Posit Workbench update, Databricks users can now easily start their clusters right from Posit Workbench. If you’re using RStudio on Posit Workbench, there’s a new Databricks pane that helps you manage your Databricks Spark clusters, as well as connections to clusters via Sparklyr. Click on the Databricks pane, and you’ll see a list of your compute clusters, their status, and more details.

Once you’ve started a cluster, you can connect to it by clicking on the Connection icon. This opens a dialogue box with all the necessary info to make the connection, including an R code snippet. Posit Workbench takes care of your Databricks credentials, so you don’t have to worry about inputting keys or tokens into the snippet.

The Databricks pane in RStudio highlighted in the top right. Then a zoom in on the Databricks pane. Then a zoom in on a cluster, which opens a dialogue box with tehe cluster ID, python environment, host URL, and the snippet for connecting.

Machine learning features

The new version now supports Logistic Regression models, Standard Scaler and Max Abs Scaler transformers, and ML Pipelines. There are a lot of useful additions, such as a feature in Connect that lets you hover over a model parameter to see a pop-up description.

The RStudio Console showing the results of an ML Connect model with the type of model, Logistic Regression, and the different parameters. Hovering over one of the parameters describes that parameter, for example, predictionCol shows prediction column name.

Read more in the documentation: First Time Connecting

Updated sparklyr cheatsheet

The sparklyr cheatsheet, revised in December 2023, includes various updates. You can use it to find the latest functions for your work!

Data Science in Spark with sparklyr cheat sheet front page.

Check it out here: sparklyr cheatsheet

Update pysparklyr to interact with Spark & Databricks Connect

We hope that you enjoy these new additions to pysparklyr. For more information: