Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2ae1a63
initial eval docs
c-ehrlich Nov 11, 2025
a082b90
add note about instrumentation fn
c-ehrlich Nov 11, 2025
7df0bdb
Stylistic fixes
manototh Nov 11, 2025
0254557
Quick fixes
manototh Nov 13, 2025
686a53e
Merge branch 'main' into evals-1
manototh Nov 13, 2025
7b8bd25
Fixes
manototh Nov 13, 2025
2251591
Add keywords
manototh Nov 14, 2025
2c662b2
Restructure Measure page
manototh Nov 17, 2025
95d4c5c
Implement review
manototh Nov 17, 2025
55e6bf4
Refactor
manototh Nov 17, 2025
3e3050c
Update measure.mdx
manototh Nov 17, 2025
89ce5ca
Update measure.mdx
manototh Nov 18, 2025
ad26f30
docs: concepts and definitions
dominicchapman Nov 18, 2025
d6a1130
docs: update overview
dominicchapman Nov 18, 2025
55703d9
docs: new evaluate section
dominicchapman Nov 18, 2025
c6d33c1
docs: create, evaluate/overview, remove measure from docs.json
dominicchapman Nov 18, 2025
9a26814
docs: revise iterate
dominicchapman Nov 18, 2025
528cf1f
docs: refinement
dominicchapman Nov 18, 2025
aad93a6
Update ai-engineering/concepts.mdx
dominicchapman Nov 20, 2025
f62b30d
docs: explain benefits of pickFlags
dominicchapman Nov 20, 2025
58becf6
docs: less focus on temperature
dominicchapman Nov 20, 2025
687548f
docs: remove duplicated content
dominicchapman Nov 20, 2025
cd7856e
docs: remove 'reference' from concepts
dominicchapman Nov 20, 2025
42890d6
docs: add model example to enum
dominicchapman Nov 20, 2025
1b6c8e7
docs: remove watch mode
dominicchapman Nov 20, 2025
3b8e48f
docs: remove marketing fluff
dominicchapman Nov 20, 2025
e6e5c6c
docs: evaluator > evaluation
dominicchapman Nov 20, 2025
93bb44b
docs: default flags to production config
dominicchapman Nov 20, 2025
4ff0249
docs: update concepts for completeness
dominicchapman Nov 20, 2025
aad74a6
docs: run-id feedback
dominicchapman Nov 20, 2025
6875722
Merge branch 'main' into dominic/evals-plus-wider-edits-v2
dominicchapman Nov 20, 2025
fd2ac49
update `createAppScope` import
c-ehrlich Nov 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 33 additions & 20 deletions ai-engineering/concepts.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Concepts"
description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Graders, Evals, and more."
keywords: ["ai engineering", "AI engineering", "concepts", "capability", "grader", "eval"]
description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Evals, Scorers, Annotations, and User Feedback."
keywords: ["ai engineering", "AI engineering", "concepts", "capability", "collection", "eval", "scorer", "annotations", "feedback", "flags"]
---

import { definitions } from '/snippets/definitions.mdx'
Expand All @@ -17,10 +17,10 @@ The concepts in AI engineering are best understood within the context of the dev
Development starts by defining a task and prototyping a <Tooltip tip={definitions.Capability}>capability</Tooltip> with a prompt to solve it.
</Step>
<Step title="Evaluate with ground truth">
The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called <Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>) to measure its quality and effectiveness using <Tooltip tip={definitions.Grader}>graders</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called "<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>") to measure its quality and effectiveness using <Tooltip tip={definitions.Scorer}>scorers</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
</Step>
<Step title="Observe in production">
Once a capability meets quality benchmarks, it’s deployed. In production, graders can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
</Step>
<Step title="Iterate with new insights">
Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.
Expand All @@ -33,40 +33,53 @@ The concepts in AI engineering are best understood within the context of the dev

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs.

Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket’s intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).
Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:

- **Single-turn model interactions**: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
- **Workflows**: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
- **Single-agent**: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
- **Multi-agent**: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.

### Collection

A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.

### Record

Records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).

### Reference
### Collection record

A reference is a historical example of a task completed successfully, serving as a benchmark for AI performance. References provide the input-output pairs that demonstrate the expected behavior and quality standards.
Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).

### Ground truth

Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.

### Annotation
### Scorer

A scorer is a function that evaluates a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score.

### Evaluation or "eval"

Annotations are expert-provided labels, corrections, or outputs added to records to establish or refine ground truth.
An evaluation, or eval, is the process of testing a capability against a collection of ground truth data using one or more scorers. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.

### Grader
### Flag

A grader is a function that scores a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.
A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, tool availability, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best.

### Evaluator (eval)
### Experiment

An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability.

### Online evaluation

An online evaluation is the process of applying a scorer to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.

### Annotation

An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.
Annotations are expert-provided observations, labels, or corrections added to production traces or evaluation results. Domain experts review AI capability runs and document what went wrong, what should have happened differently, or categorize failure modes. These annotations help identify patterns in capability failures, validate scorer accuracy, and create new test cases for collections.

### Online eval
### User feedback

An online eval is the process of applying a grader to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.
User feedback is direct signal from end users about AI capability performance, typically collected through ratings (thumbs up/down, stars) or text comments. Feedback events are associated with traces to provide context about both system behavior and user perception. Aggregated feedback reveals quality trends, helps prioritize improvements, and surfaces issues that might not appear in evaluations.

## What’s next?

Now that you understand the core concepts, see them in action in the AI engineering [workflow](/ai-engineering/quickstart).
Now that you understand the core concepts, get started with the [Quickstart](/ai-engineering/quickstart) or dive into [Evaluate](/ai-engineering/evaluate/overview) to learn about systematic testing.
210 changes: 118 additions & 92 deletions ai-engineering/create.mdx
Original file line number Diff line number Diff line change
@@ -1,133 +1,159 @@
---
title: "Create"
description: "Learn how to create and define AI capabilities using structured prompts and typed arguments with Axiom."
keywords: ["ai engineering", "AI engineering", "create", "prompt", "template", "schema"]
description: "Build AI capabilities using any framework, with best support for TypeScript-based tools."
keywords: ["ai engineering", "create", "prompt", "capability", "vercel ai sdk"]
---

import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'

The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.
Building an AI <Tooltip tip={definitions.Capability}>capability</Tooltip> starts with prototyping. You can use whichever framework you prefer. Axiom is focused on helping you evaluate and observe your capabilities rather than prescribing how to build them.

TypeScript-based frameworks like Vercel’s [AI SDK](https://sdk.vercel.ai) do integrate most seamlessly with Axiom’s tooling today, but that’s likely to evolve over time.

## Build your capability

Define your capability using your framework of choice. Here’s an example using Vercel's [AI SDK](https://ai-sdk.dev/), which includes [many examples](https://sdk.vercel.ai/examples) covering different capability design patterns. Popular alternatives like [Mastra](https://mastra.ai) also exist.

```ts src/lib/capabilities/classify-ticket.ts expandable
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { z } from 'zod';

export async function classifyTicket(input: {
subject?: string;
content: string
}) {
const result = await generateObject({
model: wrapAISDKModel(openai('gpt-4o-mini')),
messages: [
{
role: 'system',
content: 'Classify support tickets as: question, bug_report, or feature_request.',
},
{
role: 'user',
content: input.subject
? `Subject: ${input.subject}\n\n${input.content}`
: input.content,
},
],
schema: z.object({
category: z.enum(['question', 'bug_report', 'feature_request']),
confidence: z.number().min(0).max(1),
}),
});

### Defining a capability as a prompt object
return result.object;
}
```

In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.
The `wrapAISDKModel` function instruments your model calls for Axiom’s observability features. Learn more in the [Observe](/ai-engineering/observe) section.

For now, these `Prompt` objects can be defined and managed as TypeScript files within your own project repository.
## Gather reference examples

A typical `Prompt` object looks like this:
As you prototype, collect examples of inputs and their correct outputs.

```ts
const referenceExamples = [
{
input: {
subject: 'How do I reset my password?',
content: 'I forgot my password and need help.'
},
expected: { category: 'question' },
},
{
input: {
subject: 'App crashes on startup',
content: 'The app immediately crashes when I open it.'
},
expected: { category: 'bug_report' },
},
];
```

```ts /src/prompts/email-summarizer.prompt.ts
These become your ground truth for evaluation. Learn more in the [Evaluate](/ai-engineering/evaluate/overview) section.

## Structured prompt management

<Note>
The features below are experimental. Axiom’s current focus is on the evaluation and observability stages of the AI engineering workflow.
</Note>

For teams wanting more structure around prompt definitions, Axiom’s SDK includes experimental utilities for managing prompts as versioned objects.

### Define prompts as objects

Represent capabilities as structured `Prompt` objects:

```ts src/prompts/ticket-classifier.prompt.ts
import {
experimental_Type,
type experimental_Prompt
} from 'axiom/ai';

export const emailSummarizerPrompt = {
name: "Email Summarizer",
slug: "email-summarizer",
export const ticketClassifierPrompt = {
name: "Ticket Classifier",
slug: "ticket-classifier",
version: "1.0.0",
model: "gpt-4o",
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
`Summarize emails concisely, highlighting action items.
The user is named {{ username }}.`,
content: "Classify support tickets as: {{ categories }}",
},
{
role: "user",
content: "Please summarize this email: {{ email_content }}",
content: "{{ ticket_content }}",
},
],
arguments: {
username: experimental_Type.String(),
email_content: experimental_Type.String(),
categories: experimental_Type.String(),
ticket_content: experimental_Type.String(),
},
} satisfies experimental_Prompt;
```

### Strongly-typed arguments with `Template`

To ensure that prompts are used correctly, the Axiom’s AI SDK includes a `Template` type system (exported as `Type`) for defining the schema of a prompt’s `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects.

The `arguments` object uses `Template` helpers to define the shape of the context:

```typescript /src/prompts/report-generator.prompt.ts
import {
experimental_Type,
type experimental_Prompt
} from 'axiom/ai';

export const reportGeneratorPrompt = {
// ... other properties
arguments: {
company: experimental_Type.Object({
name: experimental_Type.String(),
isActive: experimental_Type.Boolean(),
departments: experimental_Type.Array(
experimental_Type.Object({
name: experimental_Type.String(),
budget: experimental_Type.Number(),
})
),
}),
priority: experimental_Type.Union([
experimental_Type.Literal("high"),
experimental_Type.Literal("medium"),
experimental_Type.Literal("low"),
]),
},
} satisfies experimental_Prompt;
### Type-safe arguments

The `experimental_Type` system provides type safety for prompt arguments:

```ts
arguments: {
user: experimental_Type.Object({
name: experimental_Type.String(),
preferences: experimental_Type.Array(experimental_Type.String()),
}),
priority: experimental_Type.Union([
experimental_Type.Literal("high"),
experimental_Type.Literal("medium"),
experimental_Type.Literal("low"),
]),
}
```

You can even infer the exact TypeScript type for a prompt’s context using the `InferContext` utility.

### Prototyping and local testing
### Local testing

Before using a prompt in your application, you can test it locally using the `parse` function. This function takes a `Prompt` object and a `context` object, rendering the templated messages to verify the output. This is a quick way to ensure your templating logic is correct.
Test prompts locally before using them:

```typescript
```ts
import { experimental_parse } from 'axiom/ai';
import {
reportGeneratorPrompt
} from './prompts/report-generator.prompt';

const context = {
company: {
name: 'Axiom',
isActive: true,
departments: [
{ name: 'Engineering', budget: 500000 },
{ name: 'Marketing', budget: 150000 },
],
},
priority: 'high' as const,
};

// Render the prompt with the given context
const parsedPrompt = await experimental_parse(
reportGeneratorPrompt, { context }
);

console.log(parsedPrompt.messages);
// [
// {
// role: 'system',
// content: 'Generate a report for Axiom.\nCompany Status: Active...'
// }
// ]
```

### Managing prompts with Axiom
const parsed = await experimental_parse(ticketClassifierPrompt, {
context: {
categories: 'question, bug_report, feature_request',
ticket_content: 'How do I reset my password?',
},
});

To enable more advanced workflows and collaboration, Axiom is building tools to manage your prompt assets centrally.
console.log(parsed.messages);
```

* <Badge>Coming soon</Badge> The `axiom` CLI will allow you to `push`, `pull`, and `list` prompt versions directly from your terminal, synchronizing your local files with the Axiom platform.
* <Badge>Coming soon</Badge> The SDK will include methods like `axiom.prompts.create()` and `axiom.prompts.load()` for programmatic access to your managed prompts. This will be the foundation for A/B testing, version comparison, and deploying new prompts without changing your application code.
These utilities help organize prompts in your codebase. Centralized prompt management and versioning features may be added in future releases.

### Whats next?
## What's next?

Now that you’ve created and structured your capability, the next step is to measure its quality against a set of known good examples.
Once you have a working capability and reference examples, systematically evaluate its performance.

Learn more about this step of the AI engineering workflow in the [Measure](/ai-engineering/measure) docs.
To learn how to set up and run evaluations, see [Evaluate](/ai-engineering/evaluate/overview).
Loading