Commercial enterprise offerings

Which AI model writes the best R code?

Sara Altman
Written by Sara Altman
Profile picture of Simon Couch
Written by Simon Couch
2023-01-13
An illustration shows a friendly cartoon bear dressed as a pediatrician, wearing a teal scrub top and cap, and holding a stethoscope. The bear is inside a hexagon-shaped emblem with a red dashed outline. Behind the bear are red lines representing a heartbeat. Below the bear, the word "vitals" is written in black lowercase letters on a white banner. The entire emblem is set against a teal background with subtle, light gray, vertical lines in the lower portion, resembling a sound wave or data visualization.

LLMs can now help you write R code. There are many available models, so which one should you pick?

We looked at a handful of models and evaluated how well they each generate R code. To do so, we used the vitals package, a framework for LLM evaluation. vitals contains functions for measuring the effectiveness of an LLM, as well as a dataset of challenging R coding problems and their solutions. We evaluated model performance on this set of coding problems.

Current recommendation: OpenAI o4-mini or Claude Sonnet 4

For R coding tasks, we recommend using OpenAI’s o4-mini or Anthropic’s Claude Sonnet 4. OpenAI’s o3 performed the best on this evaluation but is also ten times more expensive than o4-mini and around three times more expensive than Sonnet 4.

Reasoning vs. non-reasoning models

Thinking or reasoning models are LLMs that attempt to solve tasks through structured, step-by-step processing rather than just pattern-matching.

Most of the models we looked at here are reasoning models, or are capable of reasoning. The only models not designed for reasoning are GPT-4.1 and Claude Sonnet 4 with thinking disabled.

Many R programmers seem to prefer Claude Sonnet and it remains a good solution for R code generation, even though o3 and o4-mini performed slightly better in this evaluation.

Take token usage into account

A token is the fundamental unit of data that an LLM can process (for text processing, a token is approximately a word). Reasoning models, including o4-mini, often generate significantly more output tokens than non-reasoning models. So while o4-mini is inexpensive per token, its actual cost can be higher than expected.
In our evaluation, however, o4-mini was still tied for the least expensive model overall, despite using more output tokens than any model except o3 (another reasoning model).
If you have ideas for how we could better visualize or communicate model cost, we would like to hear your suggestions.

Key insights

  • OpenAI’s o3 and o4-mini and Anthropic’s Claude Sonnet 4 are the current best performers on the set of R coding tasks.

    OpenAI’s o3 and o4-mini (April 2025) and Anthropic’s Claude Sonnet 4 (May 2025) are the newest models we evaluated. Anthropic also released Claude Opus 4, which we did not evaluate, alongside Sonnet 4.

  • Claude Sonnet 4 performed similarly regardless of whether thinking was enabled.

  • o3 and o4-mini performed much better than the previous generation of reasoning models, o1 and o3-mini, which were released in December 2024 and January 2025, respectively.

Pricing

LLM pricing is typically provided per million tokens. Note that in our analysis, o3 and o4-mini performed similarly for R code generation, but o3 is about ten times more expensive. OpenAI uses the “mini” suffix for models that are smaller, faster, and cheaper than the other models.

Price per 1 million tokens
Name Input Output
o3 $10.00 $40.00
o4-mini $1.10 $4.40
Claude Sonnet 4 $3.00 $15.00
GPT-4.1 $2.00 $8.00
o1 $15.00 $60.00
o3-mini $1.10 $4.40

In our evaluation process, each model used between 29,600 and 125,300 input tokens and between 46,570 and 146,800 output tokens. The entire analysis cost around $10.

Methodology

  • We used ellmer to create connections to the various models and vitals to evaluate model performance on R code generation tasks.
  • We tested each model on a shared benchmark: the are dataset (“An R Eval”). are contains a collection of difficult R coding problems and a column, target, with information about the target solution.
  • Using vitals, we had each model solve each problem in are. Then, we scored their solutions using a scoring model (Claude 3.7 Sonnet). Each solution received either an Incorrect, Partially Correct, or Correct score.

You can see all the code used to evaluate the models here. If you’d like to see a more in-depth analysis, check out Simon Couch’s series of blog posts, which this post is based on, including Evaluating o3 and o4-mini on R coding performance.

Sara Altman

Sara Altman

Sara is a Data Science Educator on the Developer Relations team at Posit.
Profile picture of Simon Couch

Simon Couch

Software Engineer at Posit, PBC
Simon Couch is a member of the AI Core Team at Posit, working at the intersection of R and LLMs. He’s authored several packages that help R users get more out of LLMs, from package-based assistants to tools for evaluation to implementations of emerging technologies like the Model Context Protocol. Drawing on his background in statistics, Simon worked on the tidymodels framework for machine learning in R for a number of years before transitioning to working on LLMs.