When using an LLM to extract data from text or images, you can ask
the chatbot to format it in JSON or any other format that you like. This
works well most of the time, but there’s no guarantee that you’ll get
the exact format you want. In particular, if you’re trying to get JSON,
you’ll find that it’s typically surrounded in ```json
, and
you’ll occasionally get text that isn’t valid JSON. To avoid these
problems, you can use a recent LLM feature: structured
data (aka structured output). With structured data, you supply
the type specification that defines the object structure you want and
the LLM ensures that’s what you’ll get back.
To extract structured data you call $chat_structured()
instead of the $chat()
method. You’ll also need to define a
type specification that describes the structure of the data that you
want (more on that shortly). Here’s a simple example that extracts two
specific values from a string:
chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(
"My name is Susan and I'm 13 years old",
type = type_object(
name = type_string(),
age = type_number()
)
)
#> $name
#> [1] "Susan"
#>
#> $age
#> [1] 13
The same basic idea works with images too:
chat$chat_structured(
content_image_url("https://www.r-project.org/Rlogo.png"),
type = type_object(
primary_shape = type_string(),
primary_colour = type_string()
)
)
#> $primary_shape
#> [1] "a large letter 'R' with an oval/ring shape behind it"
#>
#> $primary_colour
#> [1] "blue and gray"
If you need to extract data from multiple prompts, you can use the
same techniques with parallel_chat_structured()
. It takes
the same arguments as $chat_structured()
with two
exceptions: it needs a chat
object since it’s a standalone
function, not a method, and it can take a vector of prompts.
prompts <- list(
"I go by Alex. 42 years on this planet and counting.",
"Pleased to meet you! I'm Jamal, age 27.",
"They call me Li Wei. Nineteen years young.",
"Fatima here. Just celebrated my 35th birthday last week.",
"The name's Robert - 51 years old and proud of it.",
"Kwame here - just hit the big 5-0 this year."
)
parallel_chat_structured(
chat,
prompts,
type = type_object(
name = type_string(),
age = type_number()
)
)
#> [working] (0 + 0) -> 4 -> 2 | ■■■■■■■■■■■ 33%
#> [working] (0 + 0) -> 0 -> 6 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100%
#> name age
#> 1 Alex 42
#> 2 Jamal 27
#> 3 Li Wei 19
#> 4 Fatima 35
#> 5 Robert 51
#> 6 Kwame 50
To define your desired type specification (also known as a
schema), you use the type_()
functions.
(You might already be familiar with these if you’ve done any function
calling, as discussed in vignette("function-calling")
). The
type functions can be divided into three main groups:
Scalars represent five types of single values,
type_boolean()
, type_integer()
,
type_number()
, type_string()
, and
type_enum()
, which represent a single logical, integer,
double, string, and factor value respectively.
Arrays represent any number of values of the
same type. They are created with type_array()
. You must
always supply the item
argument which specifies the type
for each individual element. Arrays of scalars are very similar to R’s
atomic vectors:
type_logical_vector <- type_array(items = type_boolean())
type_integer_vector <- type_array(items = type_integer())
type_double_vector <- type_array(items = type_number())
type_character_vector <- type_array(items = type_string())
You can also have arrays of arrays and arrays of objects, which more closely resemble lists with well defined structures:
Objects represent a collection of named values.
They are created with type_object()
. Objects can contain
any number of scalars, arrays, and other objects. They are similar to
named lists in R.
Using these type specifications ensures that the LLM will return
JSON. But ellmer goes one step further to convert the results to the
closest R analog. Currently, this converts arrays of boolean, integers,
numbers, and strings into logical, integer, numeric, and character
vectors. Arrays of objects are converted into data frames. You can
opt-out of this and get plain lists by setting
convert = FALSE
in $chat_structured()
.
In addition to defining types, you need to provide the LLM with some
information about what you actually want. This is the purpose of the
first argument, description
, which is a string that
describes the data that you want. This is a good place to ask nicely for
other attributes you’ll like the value to have (e.g. minimum or maximum
values, date formats, …). There’s no guarantee that these requests will
be honoured, but the LLM will usually make a best effort to do so.
type_type_person <- type_object(
"A person",
name = type_string("Name"),
age = type_integer("Age, in years."),
hobbies = type_array(
"List of hobbies. Should be exclusive and brief.",
items = type_string()
)
)
Now we’ll dive into some examples before coming back to talk more about the details of data types.
The following examples, which are closely inspired by the Claude documentation, hint at some of the ways you can use structured data extraction.
text <- readLines(system.file("examples/third-party-testing.txt", package = "ellmer"))
# url <- "https://www.anthropic.com/news/third-party-testing"
# html <- rvest::read_html(url)
# text <- rvest::html_text2(rvest::html_element(html, "article"))
type_summary <- type_object(
"Summary of the article.",
author = type_string("Name of the article author"),
topics = type_array(
'Array of topics, e.g. ["tech", "politics"]. Should be as specific as possible, and can overlap.',
type_string(),
),
summary = type_string("Summary of the article. One or two paragraphs max"),
coherence = type_integer("Coherence of the article's key points, 0-100 (inclusive)"),
persuasion = type_number("Article's persuasion score, 0.0-1.0 (inclusive)")
)
chat <- chat_openai()
#> Using model = "gpt-4.1".
data <- chat$chat_structured(text, type = type_summary)
cat(data$summary)
#> This article advocates for the implementation of robust, third-party testing regimes as a foundation of effective AI policy. The central premise is that frontier AI systems, such as large-scale generative models, present unique safety and misuse risks that are not sufficiently addressed by current sector-specific regulations or by company self-governance. As AI models become more capable and ubiquitous, a trustworthy, externally administered system for evaluating their behavior and limiting risks (including election integrity, discrimination, national security, and more) is critical for avoiding societal harm and knee-jerk regulatory responses.
#>
#> The article outlines the key components of an effective testing regime — trusted third parties, well-defined tests, appropriate scoping to avoid stifling innovation, and international coordination. It also addresses challenges concerning open-source models and regulatory capture, advocating for precautions to balance innovation with societal safety. Anthropic, while outlining its own Responsible Scaling Policy, positions itself as a supporter of developing this testing ecosystem through industry, academic, and governmental collaborations. The ultimate policy goal is a minimal, practical, and broadly accepted oversight structure that protects society while keeping the AI field open and competitive.
str(data)
#> List of 5
#> $ author : chr "Anthropic Policy Team (assumed from context)"
#> $ topics : chr [1:10] "AI policy" "third-party testing" "AI safety" "regulation" ...
#> $ summary : chr "This article advocates for the implementation of robust, third-party testing regimes as a foundation of effecti"| __truncated__
#> $ coherence : int 95
#> $ persuasion: num 0.87
text <- "
John works at Google in New York. He met with Sarah, the CEO of
Acme Inc., last week in San Francisco.
"
type_named_entity <- type_object(
name = type_string("The extracted entity name."),
type = type_enum("The entity type", c("person", "location", "organization")),
context = type_string("The context in which the entity appears in the text.")
)
type_named_entities <- type_array(items = type_named_entity)
chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(text, type = type_named_entities)
#> name type context
#> 1 John person An individual who works at Google in New York.
#> 2 Google organization An organization where John works.
#> 3 New York location The location where John works at Google.
#> 4 Sarah person The CEO of Acme Inc. who met with John.
#> 5 Acme Inc. organization The organization for which Sarah is the CEO.
#> 6 San Francisco location The location where John met with Sarah last week.
text <- "
The product was okay, but the customer service was terrible. I probably
won't buy from them again.
"
type_sentiment <- type_object(
"Extract the sentiment scores of a given text. Sentiment scores should sum to 1.",
positive_score = type_number("Positive sentiment score, ranging from 0.0 to 1.0."),
negative_score = type_number("Negative sentiment score, ranging from 0.0 to 1.0."),
neutral_score = type_number("Neutral sentiment score, ranging from 0.0 to 1.0.")
)
chat <- chat_openai()
#> Using model = "gpt-4.1".
str(chat$chat_structured(text, type = type_sentiment))
#> List of 3
#> $ positive_score: num 0.1
#> $ negative_score: num 0.7
#> $ neutral_score : num 0.2
Note that while we’ve asked nicely for the scores to sum 1, which they do in this example (at least when I ran the code), this is not guaranteed.
text <- "The new quantum computing breakthrough could revolutionize the tech industry."
type_classification <- type_array(
"Array of classification results. The scores should sum to 1.",
type_object(
name = type_enum(
"The category name",
values = c(
"Politics",
"Sports",
"Technology",
"Entertainment",
"Business",
"Other"
)
),
score = type_number(
"The classification score for the category, ranging from 0.0 to 1.0."
)
)
)
chat <- chat_openai()
#> Using model = "gpt-4.1".
data <- chat$chat_structured(text, type = type_classification)
data
#> name score
#> 1 Technology 0.95
#> 2 Business 0.05
type_characteristics <- type_object(
"All characteristics",
.additional_properties = TRUE
)
prompt <- "
Given a description of a character, your task is to extract all the characteristics of that character.
<description>
The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.
</description>
"
chat <- chat_anthropic()
#> Using model = "claude-3-7-sonnet-latest".
str(chat$chat_structured(prompt, type = type_characteristics))
#> list()
This example only works with Claude, not GPT or Gemini, because only Claude supports adding additional, arbitrary properties.
The final example comes from Dan Nguyen (you can see other interesting applications at that link). The goal is to extract structured data from this screenshot:
Even without any descriptions, ChatGPT does pretty well:
type_asset <- type_object(
assert_name = type_string(),
owner = type_string(),
location = type_string(),
asset_value_low = type_integer(),
asset_value_high = type_integer(),
income_type = type_string(),
income_low = type_integer(),
income_high = type_integer(),
tx_gt_1000 = type_boolean()
)
type_assets <- type_array(items = type_asset)
chat <- chat_openai()
#> Using model = "gpt-4.1".
image <- content_image_file("congressional-assets.png")
data <- chat$chat_structured(image, type = type_assets)
data
#> assert_name owner
#> 1 11 Zinfandel Lane - Home & Vineyard [RP] JT
#> 2 25 Point Lobos - Commercial Property [RP] SP
#> location asset_value_low asset_value_high
#> 1 St. Helena/Napa, CA, US 5000001 25000000
#> 2 San Francisco/San Francisco, CA, US 5000001 25000000
#> income_type income_low income_high tx_gt_1000
#> 1 Grape Sales 100001 1000000 FALSE
#> 2 Rent 100001 1000000 FALSE
Now that you’ve seen a few examples, it’s time to get into more specifics about data type declarations.
By default, all components of an object are required. If you want to
make some optional, set required = FALSE
. This is a good
idea if you don’t think your text will always contain the required
fields as LLMs may hallucinate data in order to fulfill your spec.
For example, here the LLM hallucinates a date even though there isn’t one in the text:
type_article <- type_object(
"Information about an article written in markdown",
title = type_string("Article title"),
author = type_string("Name of the author"),
date = type_string("Date written in YYYY-MM-DD format.")
)
prompt <- "
Extract data from the following text:
<text>
# Structured Data
By Hadley Wickham
When using an LLM to extract data from text or images, you can ask the chatbot to nicely format it, in JSON or any other format that you like.
</text>
"
chat <- chat_openai()
#> Using model = "gpt-4.1".
chat$chat_structured(prompt, type = type_article)
#> $title
#> [1] "Structured Data"
#>
#> $author
#> [1] "Hadley Wickham"
#>
#> $date
#> [1] ""
str(data)
#> 'data.frame': 2 obs. of 9 variables:
#> $ assert_name : chr "11 Zinfandel Lane - Home & Vineyard [RP]" "25 Point Lobos - Commercial Property [RP]"
#> $ owner : chr "JT" "SP"
#> $ location : chr "St. Helena/Napa, CA, US" "San Francisco/San Francisco, CA, US"
#> $ asset_value_low : int 5000001 5000001
#> $ asset_value_high: int 25000000 25000000
#> $ income_type : chr "Grape Sales" "Rent"
#> $ income_low : int 100001 100001
#> $ income_high : int 1000000 1000000
#> $ tx_gt_1000 : logi FALSE FALSE
Note that I’ve used more of an explict prompt here. For this example, I found that this generated better results and that it’s a useful place to put additional instructions.
If I let the LLM know that the fields are all optional, it’ll return
NULL
for the missing fields:
type_article <- type_object(
"Information about an article written in markdown",
title = type_string("Article title", required = FALSE),
author = type_string("Name of the author", required = FALSE),
date = type_string("Date written in YYYY-MM-DD format.", required = FALSE)
)
chat$chat_structured(prompt, type = type_article)
#> $title
#> [1] "Structured Data"
#>
#> $author
#> [1] "Hadley Wickham"
#>
#> $date
#> NULL
If you want to define a data frame like object, you might be tempted to create a definition similar to what R uses: an object (i.e., a named list) containing multiple vectors (i.e., an array):
type_my_df <- type_object(
name = type_array(items = type_string()),
age = type_array(items = type_integer()),
height = type_array(items = type_number()),
weight = type_array(items = type_number())
)
This, however, is not quite right becuase there’s no way to specify that each array should have the same length. Instead, you’ll need to turn the data structure “inside out” and create an array of objects:
type_my_df <- type_array(
items = type_object(
name = type_string(),
age = type_integer(),
height = type_number(),
weight = type_number()
)
)
If you’re familiar with the terms row-oriented and column-oriented data frames, this is the same idea. Since most languages don’t possess vectorisation like R, row-oriented structures tend to be much more common in the wild.
provider | model | input | output | price |
---|---|---|---|---|
OpenAI | gpt-4.1 | 9325 | 1702 | $0.03 |
OpenAI | gpt-4.1-nano | 483 | 100 | $0.00 |
Anthropic | claude-3-7-sonnet | 1268 | 1871 | $0.03 |