Announcing secure AI-Assisted data science in R with Posit and Snowflake Cortex

2024-03-03

Note

This blog post was originally posted on Medium.

The promise of AI in data science workflows, now made seamless and secure

By now, you probably know that AI is changing software development, making developers faster and more efficient. Developers can use Large language Models (LLMs) to suggest, complete, and improve their code as they write it. You may have tried to do something like this for your R-based data science workflows, using ChatGPT or another LLM alongside your data science code. If so, you probably noticed a few short-comings:

Context switching fatigue
Security concerns
Missing context, the LLM doesn’t know about your project, so it fills its suggestions with dummy variables and pseudo code that you have to strip out and replace.

Is it possible to have a seamless and secure AI assistant for R-based data science? One that knows that you are trying to analyze data — not develop software, and tailors its suggestions accordingly? Yes, it is, with the Posit Native App on Snowflake SPCS, combined with the gander and chores R packages, and Snowflake Cortex. These things are “Better Together,” providing a frictionless workflow for data science, with robust security and context-aware AI assistance.

Why data scientists need a “Better Together” approach

So far, there has been no AI tool dedicated to R-based data science workflows. To take full advantage of AI to write code, R users have had to combine disparate tools, which creates some big disadvantages:

Context Switching Fatigue: It is frustrating to move between different environments and tools to leverage LLMs. The constant copy-and-pasting requires us to be detail oriented and fastidious in the wrong places, disrupting our workflow and sapping productivity. We want code suggestions to appear directly within our source code and to accept them with the press of a key.
Security and Governance Headaches: LLMs can use the information we provide to train their models, which means the information can reappear in surprising places, which is a security nightmare. It is challenging to know whether we are using an LLM in a way that complies with corporate policy.
AI Knowledge Limitations: Does your AI know the column names of your data frame? The filenames within your project? Generic AI tools usually lack the context needed to provide accurate suggestions within a data science project. Instead they provide plausible but error prone suggestions: code that won’t run until you spend time adjusting it.
Boilerplate Burden: Chatbots cannot do repetitive tasks that appear throughout your project, like writing documentation. To do things like this, you need to run your AI inside of your project, where it can be context aware and adjust your code at the source.

You can avoid these limitations and do powerful AI assisted data science by combining the Posit Native App with Snowflake Cortex.

Posit + Snowflake Cortex — A Seamless Solution

This winning combination has three ingredients:

The Posit Native App on Snowpark Container Services (SPCS) is a unified environment for data science, directly within Snowflake. To open the app, go to Data Products > Apps in your Snowflake account sidebar. Then click on the Posit Native App. If the app is not there, ask your account administrator to install it. The Posit Native App lets you get immediately up and running within SPCS as a data scientist, launching your favorite data science IDEs like RStudio Pro, VS Code, and JupyterLab, or Jupyter Notebooks. Posit Native App is ready to use “out of box” meaning faster time to value, and is a managed software meaning reduced ops headaches.
Snowflake Cortex gives you instant access to industry-leading LLMs trained by researchers at companies like Anthropic, Mistral, Reka, Meta, and Google. Because these LLMs are fully hosted and managed by Snowflake, using them requires minimal setup. Your data stays within Snowflake, giving you the performance, scalability, and governance you expect.
Gander and chores are two R packages that use LLMs, like those provided by Snowflake Cortex, to assist data science workflows performed in the RStudio IDE. You could use gander and chores in the local RStudio Desktop IDE or in the RStudio Pro IDE provided by the Posit Native App.

gander is a chat and code completion experience, like Copilot, that knows how to query your R environment to write data science code that is ready to use.
chores provides a library of ergonomic LLM assistants that help you quickly complete repetitive, hard-to-automate tasks.

Using gander and chores with the Posit Native App in SPCS is a natural choice because:

You can start using AI immediately with no friction. Both gander and chores are pre-packaged with the Posit Native App, and pre-configured to use Snowflake Cortex from within SPCS.
You stay secure and compliant because data never leaves your Snowflake ecosystem. Since gander and chores use Snowflake Cortex to access an LLM, each interaction follows Snowflake’s strict governance guardrails — ensuring no external training or unauthorized usage of data.
You reduce the friction of context switching. Both gander and chores can seamlessly interact with Cortex models through the press of a keyboard shortcut in RStudio/Positron. Moreover, they make Enterprise-scale LLM capabilities available within your R IDE, you do not need to switch between two contexts to use them.
You receive better, more targeted suggestions from your AI, which reduces the time spent on iteration and refinement. gander automatically includes relevant project context — such as code snippets and environmental details — in prompts to the LLM, leading to more accurate and useful AI-generated outputs. As a concrete example, gander provides the LLM the name and columns of each data frame, so the LLM can suggest code that works with your data, even when you haven’t yet explicitly referenced the names in your session.

Together, the Posit Native App, gander and chores, and Snowflake Cortex create a frictionless, secure, and context-aware AI-assisted development workflow in R, all within the Snowflake data governance ecosystem. And since gander and chores create code as output, results remain consistent, auditable, and reproducible across teams.

Let’s see it in action

To see how easy it is to use these tools together, let’s use gander to generate code that creates a data visualization.

First, we open Posit Native Workbench and launch an RStudio Pro session. As we do, we click the middle box on the launch screen to provide our Snowflake Credentials.

Then we begin a data science project. We load gander and the tidyverse, which come pre-installed with the Posit Native app. Then we tell gander which LLM to use.

Many LLMs are available through Snowflake Cortex. To tell gander to use a specific LLM, like Claude 3.5 Sonnet, set the .gander_chat option.

library(tidyverse)
library(gander)

options(.gander_chat = ellmer::chat_snowflake(model = "claude-3-5-sonnet"))

Next, we use the tidyverse to connect to a Snowflake Warehouse and access a table in the HEALTHCARE catalog. This is a streamlined process because we are inside the Posit Native App. We don’t need to pass along Snowflake credentials; the Native App provides them for us. The Posit Native App will also automatically provide these credentials when gander accesses Snowflake Cortex making it as easy to use Cortex LLMs as it is to access Snowflake data in the native app!

con <- DBI::dbConnect(
  odbc::snowflake(),
  warehouse = "DEFAULT_WH"
)

heart_data <- tbl(con, dbplyr::in_catalog("HEALTHCARE", "PUBLIC", "HEART_DATA"))

heart_data |> 
  head()

It is time to write code to make a visualization. We move our cursor to where we want to insert code, and open gander.

The gander package adds a chat interface to the RStudio IDE as an add-in. We can access this add-in by opening the add-ins menu. We can also create a keyboard shortcut to launch the add-in (explained below).

We can now tell gander to “make a plot of age and serum sodium using the heart data.” gander writes the code and inserts it into our file.

Gander receives a snapshot of our environment with every request and recognizes that we are talking about the heart_data dataset. It also looks up the column names age and serum_sodium and uses them within our code. gander even realizes that heart_data is a connection to a table (and not the table itself), and adds a collect() command to import the data, so it can be plotted. This is a phenomenal level of context awareness, and surpasses what is available in AI assistants like CoPilot, which are not built for data science specific use cases.

This is great. Now we want to iterate.

We highlight our plot code and ask gander to update it to include a title and annotation line.

Behind the scenes

Gander works by collecting our prompt, augmenting it, and sending the result to Snowflake Cortex. You can inspect what gander sends with gander_peek(), making gander a flexible and transparent AI experience.

gander_peek()

The output of gander_peek() contains:

The number of tokens used
The system prompt gander sends to the LLM
The user prompt gander sends to the LLM, which includes your prompt, descriptions of your environment and the contents of your source file

Customize behavior

By default, gander adds the following system prompt to your query:

You are a helpful but terse R data scientist. Respond only with valid R code: no exposition, no backticks. Always provide a minimal solution and refrain from unnecessary additions. Use tidyverse style and, when relevant, tidyverse packages. For example, when asked to plot something, use ggplot2, or when asked to transform data, using dplyr and/or tidyr unless explicitly instructed otherwise.

You can replace the section in green to change gander’s behavior with the .gander_style option. For example, running the code below

options(.gander_style = "Use base R.")

You are a helpful but terse R data scientist. Respond only with valid R code: no exposition, no backticks. Always provide a minimal solution and refrain from unnecessary additions. Use base R.

Choose a model

Many LLMs are available through Snowflake Cortex. To tell gander to use a specific LLM, like Claude 3.5 Sonnet, set the .gander_chat option.

options(.gander_chat = ellmer::chat_snowflake(model = "claude-3-5-sonnet"))

Create a keyboard shortcut

To create a keyboard shortcut that opens the gander chat interface, go to Tools > Keyboard Shortcuts to add a custom keyboard shortcut to open the gander chat.

Now we can launch the chat interface by pressing Ctrl + Cmd + g. This shortcut will be available to us in every RStudio Pro session that we launch with the Posit Native App.

Recap: The Sum is Greater Than Its Parts

The combined power of Posit, gander, and Snowflake Cortex delivers a superior AI-assisted data science experience, with:

Immediate access to data science AI tools within Snowflake.
Enterprise-grade security and governance.
An unparalleled frictionless AI workflow for data science
Truly context-aware and intelligent AI assistance.

Accelerate development across your R-based workflows with this “better together” combo.

Go here to request a personalized demo of Posit Native App on Snowflake and see the ‘Better Together’ story in action!

Want to try this locally?

Gander and chores also work in your local RStudio Desktop IDE. Follow the installation instructions here.