Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ai-engineering/create.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import { definitions } from '/snippets/definitions.mdx'

The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.

### Defining a capability as a prompt object
### Define a capability as a prompt object

In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.

Expand Down
235 changes: 192 additions & 43 deletions ai-engineering/measure.mdx
Original file line number Diff line number Diff line change
@@ -1,85 +1,234 @@
---
title: "Measure"
description: "Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data."
keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "graders"]
keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "scorers", "graders", "scores"]
---

import { Badge } from "/snippets/badge.jsx"
import { definitions } from '/snippets/definitions.mdx'
import { definitions } from "/snippets/definitions.mdx"

<Warning>
The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools.
</Warning>

The **Measure** stage is where you quantify the quality and effectiveness of your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. Instead of relying on anecdotal checks, this stage uses a systematic process called an <Tooltip tip={definitions.Eval}>eval</Tooltip> to score your capability’s performance against a known set of correct examples (<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.

## The `Eval` function
Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time.

<Badge>Coming soon</Badge> The primary tool for the Measure stage is the `Eval` function, which will be available in the `axiom/ai` package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
## Prerequisites

An `Eval` is structured around a few key parameters:
Follow the [Quickstart](/ai-engineering/quickstart):
- To run evals within the context of an existing AI app, follow the instrumentation setup in the [Quickstart](/ai-engineering/quickstart).
- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.

* `data`: An async function that returns your `collection` of `{ input, expected }` pairs, which serve as your ground truth.
* `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
* `scorers`: An array of `grader` functions that score the `output` against the `expected` value.
* `threshold`: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
## Write evalulation function

Here is an example of a complete evaluation suite:
The `Eval` function provides a simple, declarative way to define a test suite for your capability directly in your codebase.

```ts /evals/text-match.eval.ts
import { Levenshtein } from 'autoevals';
import { Eval } from 'axiom/ai/evals';
The key parameters of the `Eval` function:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: need to better document configFlags


Eval('text-match-eval', {
// 1. Your ground truth dataset
data: async () => {
return [
- `data`: An async function that returns your collection of `{ input, expected }` pairs, which serve as your ground truth.
- `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
- `scorers`: An array of scorer functions that score the `output` against the `expected` value.
- `metadata`: Optional metadata for the evaluation, such as a description.

The example below creates an evaluation for a support ticket classification system in the file `/src/evals/ticket-classification.eval.ts`.

```ts /src/evals/ticket-classification.eval.ts expandable
import { experimental_Eval as Eval, Scorer } from 'axiom/ai/evals';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { wrapAISDKModel } from 'axiom/ai';
import { flag, pickFlags } from '../lib/app-scope';
import { z } from 'zod';

// The function you want to evaluate
async function classifyTicket({ subject, content }: { subject?: string; content: string }) {
const model = flag('ticketClassification.model');

const result = await generateObject({
model: wrapAISDKModel(openai(model)),
messages: [
{
input: 'test',
expected: 'hi, test!',
role: 'system',
content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report.
If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`,
},
{
input: 'foobar',
expected: 'hello, foobar!',
role: 'user',
content: subject ? `Subject: ${subject}\n\n${content}` : content,
},
],
schema: z.object({
category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
response: z.string()
}),
});

return result.object;
}

// Custom exact-match scorer that returns score and metadata
const ExactMatchScorer = Scorer(
'Exact-Match',
({ output, expected }: { output: { response: string }; expected: { response: string } }) => {
const normalizedOutput = output.response.trim().toLowerCase();
const normalizedExpected = expected.response.trim().toLowerCase();

return {
score: normalizedOutput === normalizedExpected,
metadata: {
details: 'A scorer that checks for exact match',
},
};
});
}
);

// Custom spam classification scorer
const SpamClassificationScorer = Scorer(
"Spam-Classification",
({ output, expected }: {
output: { category: string };
expected: { category: string };
}) => {
const isSpam = (item: { category: string }) => item.category === "spam";
return isSpam(output) === isSpam(expected) ? 1 : 0;
}
);

// Define the evaluation
Eval('spam-classification', {
// Specify which flags this eval uses
configFlags: pickFlags('ticketClassification'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only defined / explained further down the page. I understand why, and don't really have a better solution, but still feels weird.


// Test data with input/expected pairs
data: () => [
{
input: {
subject: "Congratulations! You've Been Selected for an Exclusive Reward",
content: 'Claim your $500 gift card now by clicking this link!',
},
expected: {
category: 'spam',
response: "We're sorry, but your message has been automatically closed.",
},
},
{
input: {
subject: 'FREE CA$H',
content: 'BUY NOW ON WWW.BEST-DEALS.COM!',
},
expected: {
category: 'spam',
response: "We're sorry, but your message has been automatically closed.",
},
];
},
],

// The task to run for each test case
task: async ({ input }) => {
return await classifyTicket(input);
},

// 2. The task that runs your capability
task: async (input: string) => {
return `hi, ${input}!`;

// Scorers to measure performance
scorers: [SpamClassificationScorer, ExactMatchScorer],

// Optional metadata
metadata: {
description: 'Classify support tickets as spam or not spam',
},
});
```

## Set up flags

// 3. The scorers that grade the output
scorers: [Levenshtein],
Create the file `src/lib/app-scope.ts`:

// 4. The pass/fail threshold for the scores
threshold: 1,
```ts /src/lib/app-scope.ts
import { createAppScope } from 'axiom/ai/evals';
import { z } from 'zod';

export const flagSchema = z.object({
ticketClassification: z.object({
model: z.string().default('gpt-4o-mini'),
}),
});

const { flag, pickFlags } = createAppScope({ flagSchema });

export { flag, pickFlags };
```

## Grading with scorers
## Run evaluations

<Badge>Coming soon</Badge> A <Tooltip tip={definitions.Grader}>grader</Tooltip> is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score.
To run your evaluation suites from your terminal, [install the Axiom CLI](/reference/cli) and use the following commands.

## Running evaluations
| Description | Command |
| ----------- | ------- |
| Run all evals | `axiom eval` |
| Run specific eval file | `axiom eval src/evals/ticket-classification.eval.ts` |
| Run evals matching a glob pattern | `axiom eval "**/*spam*.eval.ts"` |
| Run eval by name | `axiom eval "spam-classification"` |
| List available evals without running | `axiom eval --list` |

<Badge>Coming soon</Badge> You will run your evaluation suites from your terminal using the `axiom` CLI.
## Analyze results in Console

```bash
axiom run evals/text-match.eval.ts
```
When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.

The results of evals:
- Pass/fail status for each test case
- Scores from each scorer
- Comparison to baseline (if available)
- Links to view detailed traces in Axiom

The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.

## Additional configuration options

### Custom scorers

A scorer is a function that scores a capability’s output. Scorers receive the `input`, the generated `output`, and the `expected` value, and return a score.

The example above uses two custom scorers. Scorers can return metadata alongside the score.

This command will execute the specified test file using `vitest` in the background. Note that `vitest` will be a peer dependency for this functionality.
You can use the [`autoevals` library](https://github.com/braintrustdata/autoevals) instead of custom scorers. `autoevals` provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.

## Analyzing results in the console
### Run experiments

<Badge>Coming soon</Badge> When you run an <Tooltip tip={definitions.Eval}>eval</Tooltip>, the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.
Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime.

The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
The example above uses the `ticketClassification` flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:

- Override flags directly when you run the eval:

```bash
axiom eval --flag.ticketClassification.model=gpt-4o
```

- Alternatively, specify the flag overrides in a JSON file.

```json experiment.json
{
"ticketClassification": {
"model": "gpt-4o"
}
}
```

And then specify the JSON file as the value of the `flags-config` parameter when you run the eval:

```bash
axiom eval --flags-config=experiment.json
```

## What’s next?

Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic.
A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:

- **Baseline comparisons**: Run evals multiple times to track regression over time.
- **Experiment with flags**: Test different models or strategies using flag overrides.
- **Advanced scorers**: Build custom scorers for domain-specific metrics.
- **CI/CD integration**: Add `axiom eval` to your CI pipeline to catch regressions.

Learn more about this step of the AI engineering workflow in the [Observe](/ai-engineering/observe) docs.
The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see [Observe](/ai-engineering/observe).
4 changes: 2 additions & 2 deletions ai-engineering/observe/manual-instrumentation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Example of a properly structured chat completion trace:
```typescript TypeScript expandable
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-ai-app');
const tracer = trace.getTracer('my-app');

// Create a span for the AI operation
return tracer.startActiveSpan('chat gpt-4', {
Expand Down Expand Up @@ -233,7 +233,7 @@ from opentelemetry import trace
from opentelemetry.trace import SpanKind
import json

tracer = trace.get_tracer("my-ai-app")
tracer = trace.get_tracer("my-app")

# Create a span for the AI operation
with tracer.start_as_current_span("chat gpt-4", kind=SpanKind.CLIENT) as span:
Expand Down
Loading