Generate data with an LLM and ellmer
Note
Check out the Posit AI Connect Cloud portfolio for more AI examples and join our email list for updates on Posit + AI tools.
Generating realistic datasets is useful in various settings, from education to software testing. It enables controlled data that precisely fits one’s needs.
In this blog post, we’ll use the ellmer package to generate datasets with an LLM. ellmer simplifies the process of working with large language models (LLMs) from R.
Among other tools, we’ll create an R function that generates data with an LLM.
Specifically, we will:
- Use ellmer to create a command-line tool that generates data from a given description.
- Use ellmer and Shiny to create a Shiny app that generates data, visualizes it, and makes it available for download.
Tip
To learn more about ellmer, check out our recent blog post.
Why simulate data with an LLM?
There are many tasks you might want to simulate realistic-looking data for, including:
Package or app testing. Synthetic data that matches certain criteria is helpful for testing when developing packages or apps.
Teaching. Demonstrate a particular skill or assess students on a particular concept by creating data that precisely meets your teaching needs.
Prototyping a report. Design a report, dashboard, or app before the data is available by creating data in the same format as the real data.
Clinical trial simulations. Simulate data to explore potential outcomes, estimate statistical power, or plan analyses before conducting a clinical trial.
Synthetic data for model training or validation. Use synthetic data to overcome the limitations of real datasets in model development, protecting user privacy, simulating edge-case scenarios, and augmenting existing data. Synthetic data is especially valuable when real data is scarce.
Popular LLMs are proficient and fast at simulating realistic-looking data. Using an LLM can substantially speed up the process of creating a dataset.
You might have used ChatGPT or another LLM chat interface to simulate data. Some model providers even let you download the data directly as a CSV. However, this workflow requires manual interaction: using the chat interface, downloading the file, and loading it into your development environment.
To generate datasets programmatically, you’ll want to interact directly with the LLM’s API. That’s where ellmer comes in. We’ll use ellmer to interact with the LLM and generate data programmatically and entirely from R.
First, let’s make our command-line tool.
Command-line chat
Let’s build a small command-line tool in R that simulates data using the ellmer package.
Setup
First, we need to install the necessary packages and set up API keys. You’ll need:
- ellmer, which simplifies the process of interacting with LLMs from R and
- usethis, which we’ll use to set up the API keys.
install.packages(c("ellmer", "usethis"))Next, add your API key(s) to your .Renviron file. You can open your .Renviron for editing with usethis::edit_r_environ().
You can choose the LLM that you want to work with. ellmer includes functions for working with OpenAI, Anthropic, Gemini, and other providers. You can see a full list of supported providers here.
Add your desired API key(s) to your .Renviron, for example:
OPENAI_API_KEY=my-api-key-openai-uejkK92
ANTHROPIC_API_KEY=api-key-anthropic-nxue0
GOOGLE_API_KEY=api-key-google-palw2nInitialize a chat with ellmer
Now, we’ll create a chat with ellmer.
The chat_*() functions, like chat_openai() and chat_claude(), create a Chat object, essentially a record of the conversation between the user and the chatbot. Choose your chat_*() function based on your desired LLM provider (see the full list here).
library(ellmer)
chat <- chat_openai(
model = "gpt-4o",
system_prompt =
"Generate tabular data based on the user's request.
Limit the data to 10 rows unless the user specifically requests more."
)LLM providers typically offer multiple models. Specify your desired model with the model argument.
Because we want to use this chat to generate data, we’ll also specify a system prompt instructing the LLM to generate tabular data based on the criteria given by the user.
Tip
It’s best practice to write system prompts in a markdown file. Later, we’ll write a longer system prompt and do just that, but for now, we’ll just specify it directly to the system_prompt argument. For more information about prompt design, see https://ellmer.tidyverse.org/articles/prompt-design.html.
Start the chat
Next, we start the chat by telling the LLM to ask the user about the data they want to generate. Then, we use live_console() to continue the chat in the console.
library(ellmer)
chat <- chat_openai(
model = "gpt-4o",
system_prompt =
"Generate tabular data based on the user's request. Limit the data to 10 rows unless the user specifically requests more."
)
chat$chat("Ask what kind of data the user wants to generate.", echo = TRUE)
live_console(chat, quiet = TRUE) chat() is a method of a Chat object that sends input to the model. Here, our Chat object is called chat, so we call chat$chat(), and pass in a string representing the first message in the conversation. Now, when we start the console conversation with live_console(), the model will start by asking the user what kind of data they want.
Parse the data
We’ve gotten the LLM to simulate data for us, but it’s not really in a convenient format. If we wanted to pull this data into R, we would need to parse the string output from the LLM.
We can improve our process by asking the model to format the data like a CSV. Then, we can read the string of data into R with read_csv().
First, we need to update the system prompt to instruct the LLM to generate the data strings in a consistent CSV format.
System prompt:
Generate tabular data based on the user's request. Limit the data to 10 rows unless the user specifically requests more.
Generate the data in a format that is readable by the R function `readr::read_csv()`. For example:
TransactionID, Make, Model, Year, Price, Salesperson
001, Toyota, Camry, 2020, 24000, John Smith
002, Honda, Accord, 2019, 22000, Jane Doe
003, Ford, Explorer, 2021, 32000, Emily Johnson
004, Chevrolet, Malibu, 2020, 23000, Michael Brown
005, Nissan, Altima, 2021, 25000, Sarah Davis
006, Hyundai, Elantra, 2019, 19000, David Wilson
007, BMW, X3, 2020, 42000, Mary White
008, Audi, A4, 2021, 38000, Robert Martinez
009, Subaru, Outback, 2020, 27000, James Lee
010, Kia, Soul, 2021, 21000, Linda Thompson
Do not include any text formatting in the data (e.g., no backticks).
Do not include any other information with the dataset. With this prompt, the model should generate data in a format that’s easy for us to read into R. Instead of relying on user input in the command line, let’s turn this into a function. This will allow us to programmatically generate data (e.g., in a Shiny app or script).
library(ellmer)
generate_data <- function(data_description) {
chat <- chat_openai(
model = "gpt-4o",
system_prompt = readr::read_lines("prompt.md")
)
csv_string <- chat$chat(data_description, echo = FALSE)
readr::read_csv(csv_string, show_col_types = FALSE)
}Now, we can call generate_data(), passing in a description of the data we want to generate.
generate_data("data with two columns, x and y, where x and y are negatively correlated")# A tibble: 10 × 2
x y
<dbl> <dbl>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Tool calling
One issue with our function is that it lacks a correction mechanism if the LLM accidentally generates data that doesn’t match our criteria. read_csv() will fail, and then we’ll need to run the function again (or adjust our prompt) until we’re successful.
We can use tool calling to ensure that the LLM generates correctly formatted data and can automatically retry if an error occurs. Generally, you use tool calling to extend the functionality of an LLM. For example, LLMs generally don’t know the time at your location, but we can write an R function that returns that information. Then, we tell the LLM to call that R function and use the result in some way, essentially granting the LLM new skills.
In our example, the LLM can generate data as text but can’t actually read that data into R. We’ll create a tool that reads the generated string of data into R. ellmer will handle retrying if the tool call fails because the LLM generates a string that read_csv() can’t parse.
For more information about tool calling, see Tool Calling.
Define a tool function
First, we need to write a function that takes a CSV-formatted string and reads it into a data frame. We’ll also need to document the function with roxygen2 comments. These comments help the LLM understand how to use our function, just like they would help a human use your function.
#' Reads a CSV string into R using `readr::read_csv()`
#'
#' @param csv_string A single string representing literal data in CSV format, able to be read by `readr::read_csv()`.
#' @return A tibble.
read_csv_string <- function(csv_string) {
read_csv(csv_string, show_col_types = FALSE)
}Register the tool
Next, we need to register our tool. Registering a tool tells the model about the tool and how to use it. With ellmer, you register a tool using the Chat method register_tool() and the function tool().
You don’t need to write the tool registering code yourself. Instead, you can use ellmer::create_tool_def() to generate the register_tool() call. For example:
create_tool_def(read_csv_string)```r
tool(
read_csv_string,
"Reads a CSV string into R using `readr::read_csv()`.",
csv_string = type_string(
"A single string representing literal data in CSV format, able to be read by `readr::read_csv()`."
)
)
```
Now, we can copy and paste the above code into chat$register_tool(). To learn more about defining a tool with tool(), see https://ellmer.tidyverse.org/reference/tool.html.
chat$register_tool(
tool(
read_csv_string,
"Parses a string containing CSV formatted data and reads it into a data frame.",
csv_string = type_string(
"CSV string containing the data to be read."
)
)
)Here’s the full script:
library(ellmer)
library(readr)
#' Reads a CSV string into R using `readr::read_csv()`
#'
#' @param csv_string A single string representing literal data in CSV format, able to be read by `readr::read_csv()`.
#' @return A tibble.
read_csv_string <- function(csv_string) {
read_csv(csv_string, show_col_types = FALSE)
}
chat <- chat_openai(
model = "gpt-4o",
system_prompt = read_lines("prompt-csv-string-tool.md")
)
chat$register_tool(tool(
read_csv_string,
"Parses a string containing CSV formatted data and reads it into a data frame.",
csv_string = type_string(
"CSV string containing the data to be read."
)
))
chat$chat("tax data by state, with three columns", echo = FALSE) Update the system prompt
The final step is to update our prompt to instruct the LLM when to use the tool. Because our tool is pretty simple, we could probably get away without doing this step, but it’s good practice and will make it more likely that the model behaves as expected.
System prompt:
Generate tabular data based on the user's request. Limit the data to 10 rows unless the user specifically requests more.
Create a string of data in a format that is readable by the R function `readr::read_csv()`. For example:
TransactionID, Make, Model, Year, Price, Salesperson
001, Toyota, Camry, 2020, 24000, John Smith
002, Honda, Accord, 2019, 22000, Jane Doe
003, Ford, Explorer, 2021, 32000, Emily Johnson
004, Chevrolet, Malibu, 2020, 23000, Michael Brown
005, Nissan, Altima, 2021, 25000, Sarah Davis
006, Hyundai, Elantra, 2019, 19000, David Wilson
007, BMW, X3, 2020, 42000, Mary White
008, Audi, A4, 2021, 38000, Robert Martinez
009, Subaru, Outback, 2020, 27000, James Lee
010, Kia, Soul, 2021, 21000, Linda Thompson
Do not include any text formatting in the data (e.g., no backticks). Do not include any other information with the dataset.
Then, pass that string of data to the function `read_csv_string()`. This function will use `readr::read_csv()` to read the string into R. The function returns a tibble.
Treat a tibble response from `read_csv_string()` as a success. If the function throws an error, treat that as a failure and re-call the function with a different string of data. Alone, however, our script isn’t very helpful. The LLM generates data, but it doesn’t get saved anywhere. chat$chat() returns as a string, so if we ask the LLM to return our table we’re going to run into the same issue as before: we need a way to read it into R.
To allow the data to persist, we could write it to a file or database. If we’re using our tool in a Shiny app, we can update a reactive value to the generated data. Let’s see how that would work.
Generate data from a Shiny app
Tip
Experiment with the deployed app here.
To embed our data generation functionality in a Shiny app, we will:
- Create an input to capture the user’s dataset description, and then tell the model about that input.
- Edit our tool function to assign the newly created dataset to a reactive value. This will allow us to use the newly created dataset in various places in the Shiny app.
- Update our prompt.
First, let’s manage the user input. In the app UI, we’ll need some kind of text input and a button that triggers the data generation.
# Define the app UI
ui <- page_sidebar(
sidebar = sidebar(
textAreaInput("data_description", "Describe the data you want to generate."),
actionButton("btn_generate", "Generate data")
)
# Rest of the UI
)Now, in the server function we’ll create a Chat and then send the user’s dataset description to the chatbot when the user clicks the button.
# Server function
server <- function(input, output, session) {
chat <- chat_openai(
model = "gpt-4o",
system_prompt = read_lines("prompt-app.md")
)
# When btn_generate is clicked, send data_description to the LLM
observeEvent(input$btn_generate, {
chat$chat(input$data_description)
})
# Rest of the server function
}
shinyApp(ui, server)Next, we need to add our tool-calling functionality. Just like before, we need to define the tool function and then register the tool. This will look similar to how we defined a tool earlier, with one change: we’ll update a reactive value to the created dataset.
# Server function
server <- function(input, output, session) {
# Create a reactive value for the data
data_rv <- reactiveVal()
# Define tools -------
#' Reads a CSV string into R using `readr::read_csv()`
#'
#' @param csv_string A single string representing literal data in CSV format, able to be read by `readr::read_csv()`.
#' @return A tibble.
read_csv_string <- function(csv_string) {
df <- read_csv(csv_string, show_col_types = FALSE)
# Update reactive value
data_rv(df)
}
chat <- chat_openai(
model = "gpt-4o",
system_prompt = read_lines("prompt-app.md")
)
# Register tools -------
chat$register_tool(tool(
read_csv_string,
"Parses a string containing CSV formatted data and reads it into a data frame.",
csv_string = type_string(
"CSV string containing the data to be read."
)
))
# Rest of the server function
# ...
}Now, we’ll be able to use the generated dataset in the rest of our app. We’ll add functionality to download the data as a CSV, view the data in a table, and visualize the data in a plot.
We’ll need to make a few changes to the prompt, including telling the app how to use the plotting tool. You can see the full prompt here.
Other strategies
We relied on the LLM’s native data generation capabilities, asking the model to create a string of data. This works pretty well for the small examples shown here, but you may want more control over the data, care about reproducibility, or have specific requirements for the data that the LLM doesn’t always get right.
Code generation
Another option is to ask the LLM to generate R code that creates data, which provides a lot of flexibility but may take careful prompt design. Alternatively, you can have the LLM generate arguments to an R function (e.g., arguments to rnorm() or to functions from the charlatan package) to maintain greater control over the data structure.
Below is a function that evaluates code passed as a string. You can see the full example here.
#' Evaluates R code that simulates data.
#'
#' @param code A single string containing valid R code that generates data.
#' @return A tibble.
create_data <- function(code) {
tryCatch(
{
df <- eval(parse(text = code))
},
error = function(err) {
stop(err)
}
)
stopifnot(is.data.frame(df))
write_csv(df, "data-from-code.csv")
}If you want to store the data in a database, you could also ask the LLM to write SQL code to generate the data and write to a database.
con <- dbConnect(duckdb(), dbdir = ":memory:", read_only = FALSE)
#' Executes a SQL query to create a new table with simulated data
#'
#' @param query A DuckDB SQL query; must be a CREATE statement.
#' @returns A string representing the query
create_data <- function(query) {
tryCatch(
{
dbExecute(con, query)
},
error = function(err) {
stop(err)
}
)
query
}
chat <- chat_openai(
model = "gpt-4o",
system_prompt = read_lines(here::here("prompts/prompt-sql.md"))
)
chat$register_tool(tool(
create_data,
"Executes a DuckDB SQL query that creates a new table with simulated data.",
query = type_string(
"A string contained valid DuckDB SQL that generates data. The query must be a CREATE statement."
)
))See the full example and prompt.
Caution
Letting an LLM generate and execute code provides flexibility, but can be risky. Be careful to clearly specify what code you want and don’t want the model to run. Set guardrails in your prompts, check the code before executing, and put additional safeguards in place whenever possible.
Structured data
If you know the column names and types of data you want to generate, you could also use structured data. The structured data feature allows you to define the elements of data you want to extract from text or images. However, this only works if you know the number of columns and the types that you want to extract.
chat <- chat_openai(
model = "gpt-4o",
system_prompt = "Given a description, generate structured data."
)
response <-
chat$chat(
"data with 2 columns, x and y. x should have a normal distribution and y should contain random strings.",
echo = FALSE
)
df <-
chat$extract_data(
response,
type = type_array(
items = type_object(
x = type_number(),
y = type_string()
)
)
)Recap
Data generation is useful in many contexts and is a great, relatively straightforward, use of an LLM API. In this blog post, you saw how to create tools with ellmer to generate data.
See all the code and prompts used in the examples here.
In these examples, we limit the size of the generated datasets to minimize API usage costs. However, for your own work, you can adjust the prompts to generate larger datasets as needed.
Note
For more AI examples, check out the Posit AI Connect Cloud portfolio.