Commercial enterprise offerings

Which AI model writes the best R code?

Written by Sara Altman

Written by Simon Couch

2023-01-13

An illustration shows a friendly cartoon bear dressed as a pediatrician, wearing a teal scrub top and cap, and holding a stethoscope. The bear is inside a hexagon-shaped emblem with a red dashed outline. Behind the bear are red lines representing a heartbeat. Below the bear, the word "vitals" is written in black lowercase letters on a white banner. The entire emblem is set against a teal background with subtle, light gray, vertical lines in the lower portion, resembling a sound wave or data visualization.

Compare model performance in Shiny app

In a series of past blog posts, we evaluated how well various models generate R code. To do so, we used the vitals package, a framework for LLM evaluation. vitals contains functions for measuring the effectiveness of an LLM, as well as are, a dataset of challenging R coding problems and their solutions. We evaluated model performance on this set of coding problems.

Since we started this series, there has been a proliferation of new models. To help keep this evaluation up-to-date, we created a Shiny app to visualize results. Going forward, we’ll add new results to this app. The app lets you select and compare any of the models we’ve evaluated on both performance and cost. You can see the app here.

We’ll summarize current results in this post, but check out the app to create your own custom comparisons.

Current recommendation: Claude Sonnet 4.5, Claude Opus 4.5, or OpenAI GPT-5

For R coding tasks, we recommend using Claude Sonnet 4.5, Claude Opus 4.5, or OpenAI GPT-5. Gemini 3 Pro, Google Gemini’s latest model, also scored well. Notably, Claude Sonnet 4.5, released in September, shows marked improvements over version 4.

Take token usage into account

A token is the fundamental unit of data that an LLM can process (for text processing, a token is approximately a word). Different models use vastly different amounts of tokens. As a result, a model that is inexpensive on a per-token basis can, in practice, cost much more if it produces longer outputs.

The above plot shows the actual costs incurred during the evaluation so it takes token count and per token cost into account.

Key insights

Claude Opus 4.5, Claude Sonnet 4.5, and OpenAI GPT-5 are the current best performers on the set of R coding tasks.

These models represent some of the latest model releases from the major AI labs.

Claude Opus 4.5 (released November 2025) and Claude Sonnet 4.5 (released September 2025) show significant improvements over their predecessors. However, OpenAI’s most recent release GPT-5.1 (November 2025) performed worse than GPT-5.¹
Gemini 3, Google Gemini’s latest model, also performed well. It is slightly more expensive than our top three recommendations.
Claude models show consistent improvement across versions. Claude Sonnet 4.5 notably outperforms version 4 (released May 2025), and Opus 4.5 is cheaper and performed better than Opus 4.1.

Pricing

LLM pricing is typically provided per million tokens. In our evaluation process, each model used between 26,070 and 31,400 input tokens and between 26,450 and 198,800 output tokens.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Input Tokens Used	Output Tokens Used	Total Cost	% Correct
Model Pricing and Performance Details
Sorted by percent correct (descending)
Claude Opus 4.5	$5.00	$25.00	31,401	69,699	$1.90	70.1%
Claude Sonnet 4.5	$3.00	$15.00	31,314	61,936	$1.02	66.7%
GPT-5	$1.25	$10.00	26,067	167,873	$1.71	66.7%
Claude Opus 4.1	$15.00	$75.00	31,314	57,695	$4.80	64.4%
Gemini 3	$2.00	$12.00	29,610	198,754	$2.44	64.4%
Claude Sonnet 4	$3.00	$15.00	31,314	61,704	$1.02	55.2%
GPT-5.1	$1.25	$10.00	26,067	26,453	$0.30	48.3%

Methodology

We used ellmer to create connections to the various models and vitals to evaluate model performance on R code generation tasks.
We tested each model on a shared benchmark: the are dataset (“An R Eval”). are contains a collection of difficult R coding problems and a column, target, with information about the target solution.
Using vitals, we had each model solve each problem in are. Then, we scored their solutions using a scoring model (Claude 3.7 Sonnet). Each solution received either an Incorrect, Partially Correct, or Correct score.

You can see all the code used to evaluate the models here. If you’d like to see a more in-depth analysis, check out Simon Couch’s series of blog posts, which this post is based on, including Claude 4 and R Coding.

Footnotes

GPT-5.1 Codex, which is not shown here but can be seen in the app, performed substantially better than GPT-5.1.↩︎

Sara Altman

Sara is a Data Science Educator on the Developer Relations team at Posit.

Simon Couch

Software Engineer at Posit, PBC

Simon Couch is a member of the AI Core Team at Posit, working at the intersection of R and LLMs. He’s authored several packages that help R users get more out of LLMs, from package-based assistants to tools for evaluation to implementations of emerging technologies like the Model Context Protocol. Drawing on his background in statistics, Simon worked on the tidymodels framework for machine learning in R for a number of years before transitioning to working on LLMs.

Which AI model writes the best R code?

Compare model performance in Shiny app

Current recommendation: Claude Sonnet 4.5, Claude Opus 4.5, or OpenAI GPT-5

Key insights

Pricing

Methodology

Footnotes

Sara Altman

Simon Couch

Related Content

Snowflake Native Apps vs. Connected Apps for financial services

We prove model governance by live evidence, not paperwork

How BFSI IT leaders unify R and Python across Databricks, Snowflake, and the big three clouds