axiomhq · manototh · Nov 11, 2025 · Nov 11, 2025 · Nov 11, 2025 · Nov 13, 2025
diff --git a/ai-engineering/create.mdx b/ai-engineering/create.mdx
@@ -9,7 +9,7 @@ import { definitions } from '/snippets/definitions.mdx'
 
 The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.
 
-### Defining a capability as a prompt object
+### Define a capability as a prompt object
 
 In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.
 

diff --git a/ai-engineering/measure.mdx b/ai-engineering/measure.mdx
@@ -1,85 +1,234 @@
 ---
 title: "Measure"
 description: "Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data."
-keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "graders"]
+keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "scorers", "graders", "scores"]
 ---
 
 import { Badge } from "/snippets/badge.jsx"
-import { definitions } from '/snippets/definitions.mdx'
+import { definitions } from "/snippets/definitions.mdx"
 
 <Warning>
 The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools.
 </Warning>
 
 The **Measure** stage is where you quantify the quality and effectiveness of your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. Instead of relying on anecdotal checks, this stage uses a systematic process called an <Tooltip tip={definitions.Eval}>eval</Tooltip> to score your capability’s performance against a known set of correct examples (<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time.
 
-## The `Eval` function
+Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time.
 
-<Badge>Coming soon</Badge> The primary tool for the Measure stage is the `Eval` function, which will be available in the `axiom/ai` package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
+## Prerequisites
 
-An `Eval` is structured around a few key parameters:
+Follow the [Quickstart](/ai-engineering/quickstart):
+- To run evals within the context of an existing AI app, follow the instrumentation setup in the [Quickstart](/ai-engineering/quickstart).
+- To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app.
 
-* `data`: An async function that returns your `collection` of `{ input, expected }` pairs, which serve as your ground truth.
-* `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
-* `scorers`: An array of `grader` functions that score the `output` against the `expected` value.
-* `threshold`: A score between 0 and 1 that determines the pass/fail condition for the evaluation.
+## Write evalulation function
 
-Here is an example of a complete evaluation suite:
+The `Eval` function provides a simple, declarative way to define a test suite for your capability directly in your codebase.
 
-```ts /evals/text-match.eval.ts
-import { Levenshtein } from 'autoevals';
-import { Eval } from 'axiom/ai/evals';
+The key parameters of the `Eval` function:
 
-Eval('text-match-eval', {
-  // 1. Your ground truth dataset
-  data: async () => {
-    return [
+- `data`: An async function that returns your collection of `{ input, expected }` pairs, which serve as your ground truth.
+- `task`: The function that executes your AI capability, taking an `input` and producing an `output`.
+- `scorers`: An array of scorer functions that score the `output` against the `expected` value.
+- `metadata`: Optional metadata for the evaluation, such as a description.
+
+The example below creates an evaluation for a support ticket classification system in the file `/src/evals/ticket-classification.eval.ts`.
+
+```ts /src/evals/ticket-classification.eval.ts expandable
+import { experimental_Eval as Eval, Scorer } from 'axiom/ai/evals';
+import { generateObject } from 'ai';
+import { openai } from '@ai-sdk/openai';
+import { wrapAISDKModel } from 'axiom/ai';
+import { flag, pickFlags } from '../lib/app-scope';
+import { z } from 'zod';
+
+// The function you want to evaluate
+async function classifyTicket({ subject, content }: { subject?: string; content: string }) {
+  const model = flag('ticketClassification.model');
+
+  const result = await generateObject({
+    model: wrapAISDKModel(openai(model)),
+    messages: [
       {
-        input: 'test',
-        expected: 'hi, test!',
+        role: 'system',
+        content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report.
+        If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`,
       },
       {
-        input: 'foobar',
-        expected: 'hello, foobar!',
+        role: 'user',
+        content: subject ? `Subject: ${subject}\n\n${content}` : content,
+      },
+    ],
+    schema: z.object({
+      category: z.enum(['spam', 'question', 'feature_request', 'bug_report']),
+      response: z.string()
+    }),
+  });
+
+  return result.object;
+}
+
+// Custom exact-match scorer that returns score and metadata
+const ExactMatchScorer = Scorer(
+  'Exact-Match',
+  ({ output, expected }: { output: { response: string }; expected: { response: string } }) => {
+    const normalizedOutput = output.response.trim().toLowerCase();
+    const normalizedExpected = expected.response.trim().toLowerCase();
+
+    return {
+      score: normalizedOutput === normalizedExpected,
+      metadata: {
+        details: 'A scorer that checks for exact match',
+      },
+    };
+    });
+  }
+);
+
+// Custom spam classification scorer
+const SpamClassificationScorer = Scorer(
+  "Spam-Classification",
+  ({ output, expected }: {
+    output: { category: string };
+    expected: { category: string };
+  }) => {
+    const isSpam = (item: { category: string }) => item.category === "spam";
+    return isSpam(output) === isSpam(expected) ? 1 : 0;
+  }
+);
+
+// Define the evaluation
+Eval('spam-classification', {
+  // Specify which flags this eval uses
+  configFlags: pickFlags('ticketClassification'),
+
+  // Test data with input/expected pairs
+  data: () => [
+    {
+      input: {
+        subject: "Congratulations! You've Been Selected for an Exclusive Reward",
+        content: 'Claim your $500 gift card now by clicking this link!',
+      },
+      expected: {
+        category: 'spam',
+        response: "We're sorry, but your message has been automatically closed.",
+      },
+    },
+    {
+      input: {
+        subject: 'FREE CA$H',
+        content: 'BUY NOW ON WWW.BEST-DEALS.COM!',
+      },
+      expected: {
+        category: 'spam',
+        response: "We're sorry, but your message has been automatically closed.",
       },
-    ];
+    },
+  ],
+
+  // The task to run for each test case
+  task: async ({ input }) => {
+    return await classifyTicket(input);
   },
-
-  // 2. The task that runs your capability
-  task: async (input: string) => {
-    return `hi, ${input}!`;
+
+  // Scorers to measure performance
+  scorers: [SpamClassificationScorer, ExactMatchScorer],
+
+  // Optional metadata
+  metadata: {
+    description: 'Classify support tickets as spam or not spam',
   },
+});
+```
+
+## Set up flags
 
-  // 3. The scorers that grade the output
-  scorers: [Levenshtein],
+Create the file `src/lib/app-scope.ts`:
 
-  // 4. The pass/fail threshold for the scores
-  threshold: 1,
+```ts /src/lib/app-scope.ts
+import { createAppScope } from 'axiom/ai/evals';
+import { z } from 'zod';
+
+export const flagSchema = z.object({
+  ticketClassification: z.object({
+    model: z.string().default('gpt-4o-mini'),
+  }),
 });
+
+const { flag, pickFlags } = createAppScope({ flagSchema });
+
+export { flag, pickFlags };
 ```
 
-## Grading with scorers
+## Run evaluations
 
-<Badge>Coming soon</Badge> A <Tooltip tip={definitions.Grader}>grader</Tooltip> is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score.
+To run your evaluation suites from your terminal, [install the Axiom CLI](/reference/cli) and use the following commands.
 
-## Running evaluations
+| Description | Command |
+| ----------- | ------- |
+| Run all evals | `axiom eval` |
+| Run specific eval file | `axiom eval src/evals/ticket-classification.eval.ts` |
+| Run evals matching a glob pattern | `axiom eval "**/*spam*.eval.ts"` |
+| Run eval by name | `axiom eval "spam-classification"` |
+| List available evals without running | `axiom eval --list` |
 
-<Badge>Coming soon</Badge> You will run your evaluation suites from your terminal using the `axiom` CLI.
+## Analyze results in Console
 
-```bash
-axiom run evals/text-match.eval.ts
-```
+When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.
+
+The results of evals:
+- Pass/fail status for each test case
+- Scores from each scorer
+- Comparison to baseline (if available)
+- Links to view detailed traces in Axiom
+
+The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
+
+## Additional configuration options
+
+### Custom scorers
+
+A scorer is a function that scores a capability’s output. Scorers receive the `input`, the generated `output`, and the `expected` value, and return a score.
+
+The example above uses two custom scorers. Scorers can return metadata alongside the score.
 
-This command will execute the specified test file using `vitest` in the background. Note that `vitest` will be a peer dependency for this functionality.
+You can use the [`autoevals` library](https://github.com/braintrustdata/autoevals) instead of custom scorers. `autoevals` provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching.
 
-## Analyzing results in the console
+### Run experiments
 
-<Badge>Coming soon</Badge> When you run an <Tooltip tip={definitions.Eval}>eval</Tooltip>, the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console.
+Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime.
 
-The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements.
+The example above uses the `ticketClassification` flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways:
+
+- Override flags directly when you run the eval:
+
+    ```bash
+    axiom eval --flag.ticketClassification.model=gpt-4o
+    ```
+
+- Alternatively, specify the flag overrides in a JSON file.
+
+    ```json experiment.json
+    {
+      "ticketClassification": {
+        "model": "gpt-4o"
+      }
+    }
+    ```
+
+    And then specify the JSON file as the value of the `flags-config` parameter when you run the eval:
+
+    ```bash
+    axiom eval --flags-config=experiment.json
+    ```
 
 ## What’s next?
 
-Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic.
+A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following:
+
+- **Baseline comparisons**: Run evals multiple times to track regression over time.
+- **Experiment with flags**: Test different models or strategies using flag overrides.
+- **Advanced scorers**: Build custom scorers for domain-specific metrics.
+- **CI/CD integration**: Add `axiom eval` to your CI pipeline to catch regressions.
 
-Learn more about this step of the AI engineering workflow in the [Observe](/ai-engineering/observe) docs.
+The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see [Observe](/ai-engineering/observe).
diff --git a/ai-engineering/observe/manual-instrumentation.mdx b/ai-engineering/observe/manual-instrumentation.mdx
@@ -188,7 +188,7 @@ Example of a properly structured chat completion trace:
 ```typescript TypeScript expandable
 import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
-const tracer = trace.getTracer('my-ai-app');
+const tracer = trace.getTracer('my-app');
 
 // Create a span for the AI operation
 return tracer.startActiveSpan('chat gpt-4', {
@@ -233,7 +233,7 @@ from opentelemetry import trace
 from opentelemetry.trace import SpanKind
 import json
 
-tracer = trace.get_tracer("my-ai-app")
+tracer = trace.get_tracer("my-app")
 
 # Create a span for the AI operation
 with tracer.start_as_current_span("chat gpt-4", kind=SpanKind.CLIENT) as span: