axiomhq · dominicchapman · Nov 11, 2025 · Nov 11, 2025 · Nov 11, 2025 · Nov 13, 2025
diff --git a/ai-engineering/concepts.mdx b/ai-engineering/concepts.mdx
@@ -1,7 +1,7 @@
 ---
 title: "Concepts"
-description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Graders, Evals, and more."
-keywords: ["ai engineering", "AI engineering", "concepts", "capability", "grader", "eval"]
+description: "Learn about the core concepts in AI engineering: Capabilities, Collections, Evals, Scorers, Annotations, and User Feedback."
+keywords: ["ai engineering", "AI engineering", "concepts", "capability", "collection", "eval", "scorer", "annotations", "feedback", "flags"]
 ---
 
 import { definitions } from '/snippets/definitions.mdx'
@@ -17,10 +17,10 @@ The concepts in AI engineering are best understood within the context of the dev
     Development starts by defining a task and prototyping a <Tooltip tip={definitions.Capability}>capability</Tooltip> with a prompt to solve it.
   </Step>
   <Step title="Evaluate with ground truth">
-    The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called “<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>”) to measure its quality and effectiveness using <Tooltip tip={definitions.Grader}>graders</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
+    The prototype is then tested against a <Tooltip tip={definitions.Collection}>collection</Tooltip> of reference examples (so called "<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>") to measure its quality and effectiveness using <Tooltip tip={definitions.Scorer}>scorers</Tooltip>. This process is known as an <Tooltip tip={definitions.Eval}>eval</Tooltip>.
   </Step>
   <Step title="Observe in production">
-    Once a capability meets quality benchmarks, it’s deployed. In production, graders can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
+    Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic (<Tooltip tip={definitions.OnlineEval}>online evals</Tooltip>) to monitor performance and cost in real-time.
   </Step>
   <Step title="Iterate with new insights">
     Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.
@@ -33,40 +33,53 @@ The concepts in AI engineering are best understood within the context of the dev
 
 A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs.
 
-Capabilities exist on a spectrum of complexity. They can be a simple, single-step function (for example, classifying a support ticket’s intent) or evolve into a sophisticated, multi-step agent that uses reasoning and tools to achieve a goal (for example, orchestrating a complete customer support resolution).
+Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:
+
+- **Single-turn model interactions**: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
+- **Workflows**: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
+- **Single-agent**: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
+- **Multi-agent**: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.
 
 ### Collection
 
 A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.
 
-### Record
-
-Records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).
-
-### Reference
+### Collection record
 
-A reference is a historical example of a task completed successfully, serving as a benchmark for AI performance. References provide the input-output pairs that demonstrate the expected behavior and quality standards.
+Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).
 
 ### Ground truth
 
 Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.
 
-### Annotation
+### Scorer
+
+A scorer is a function that evaluates a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score.
+
+### Evaluation or "eval"
 
-Annotations are expert-provided labels, corrections, or outputs added to records to establish or refine ground truth.
+An evaluation, or eval, is the process of testing a capability against a collection of ground truth data using one or more scorers. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.
 
-### Grader
+### Flag
 
-A grader is a function that scores a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score or judgment. Graders are the reusable, atomic scoring logic used in all forms of evaluation.
+A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, tool availability, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best.
 
-### Evaluator (eval)
+### Experiment
+
+An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability.
+
+### Online evaluation
+
+An online evaluation is the process of applying a scorer to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.
+
+### Annotation
 
-An evaluator, or eval, is the process of testing a capability against a collection of ground truth data using one or more graders. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.
+Annotations are expert-provided observations, labels, or corrections added to production traces or evaluation results. Domain experts review AI capability runs and document what went wrong, what should have happened differently, or categorize failure modes. These annotations help identify patterns in capability failures, validate scorer accuracy, and create new test cases for collections.
 
-### Online eval
+### User feedback
 
-An online eval is the process of applying a grader to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.
+User feedback is direct signal from end users about AI capability performance, typically collected through ratings (thumbs up/down, stars) or text comments. Feedback events are associated with traces to provide context about both system behavior and user perception. Aggregated feedback reveals quality trends, helps prioritize improvements, and surfaces issues that might not appear in evaluations.
 
 ## What’s next?
 
-Now that you understand the core concepts, see them in action in the AI engineering [workflow](/ai-engineering/quickstart).
+Now that you understand the core concepts, get started with the [Quickstart](/ai-engineering/quickstart) or dive into [Evaluate](/ai-engineering/evaluate/overview) to learn about systematic testing.
diff --git a/ai-engineering/create.mdx b/ai-engineering/create.mdx
@@ -1,133 +1,159 @@
 ---
 title: "Create"
-description: "Learn how to create and define AI capabilities using structured prompts and typed arguments with Axiom."
-keywords: ["ai engineering", "AI engineering", "create", "prompt", "template", "schema"]
+description: "Build AI capabilities using any framework, with best support for TypeScript-based tools."
+keywords: ["ai engineering", "create", "prompt", "capability", "vercel ai sdk"]
 ---
 
-import { Badge } from "/snippets/badge.jsx"
 import { definitions } from '/snippets/definitions.mdx'
 
-The **Create** stage is about defining a new AI <Tooltip tip={definitions.Capability}>capability</Tooltip> as a structured, version-able asset in your codebase. The goal is to move away from scattered, hard-coded string prompts and toward a more disciplined and organized approach to prompt engineering.
+Building an AI <Tooltip tip={definitions.Capability}>capability</Tooltip> starts with prototyping. You can use whichever framework you prefer. Axiom is focused on helping you evaluate and observe your capabilities rather than prescribing how to build them.
+
+TypeScript-based frameworks like Vercel’s [AI SDK](https://sdk.vercel.ai) do integrate most seamlessly with Axiom’s tooling today, but that’s likely to evolve over time.
+
+## Build your capability
+
+Define your capability using your framework of choice. Here’s an example using Vercel's [AI SDK](https://ai-sdk.dev/), which includes [many examples](https://sdk.vercel.ai/examples) covering different capability design patterns. Popular alternatives like [Mastra](https://mastra.ai) also exist.
+
+```ts src/lib/capabilities/classify-ticket.ts expandable
+import { generateObject } from 'ai';
+import { openai } from '@ai-sdk/openai';
+import { wrapAISDKModel } from 'axiom/ai';
+import { z } from 'zod';
+
+export async function classifyTicket(input: { 
+  subject?: string; 
+  content: string 
+}) {
+  const result = await generateObject({
+    model: wrapAISDKModel(openai('gpt-4o-mini')),
+    messages: [
+      {
+        role: 'system',
+        content: 'Classify support tickets as: question, bug_report, or feature_request.',
+      },
+      {
+        role: 'user',
+        content: input.subject 
+          ? `Subject: ${input.subject}\n\n${input.content}` 
+          : input.content,
+      },
+    ],
+    schema: z.object({
+      category: z.enum(['question', 'bug_report', 'feature_request']),
+      confidence: z.number().min(0).max(1),
+    }),
+  });
 
-### Defining a capability as a prompt object
+  return result.object;
+}
+```
 
-In Axiom AI engineering, every capability is represented by a `Prompt` object. This object serves as the single source of truth for the capability’s logic, including its messages, metadata, and the schema for its arguments.
+The `wrapAISDKModel` function instruments your model calls for Axiom’s observability features. Learn more in the [Observe](/ai-engineering/observe) section.
 
-For now, these `Prompt` objects can be defined and managed as TypeScript files within your own project repository.
+## Gather reference examples
 
-A typical `Prompt` object looks like this:
+As you prototype, collect examples of inputs and their correct outputs.
+
+```ts
+const referenceExamples = [
+  {
+    input: { 
+      subject: 'How do I reset my password?',
+      content: 'I forgot my password and need help.' 
+    },
+    expected: { category: 'question' },
+  },
+  {
+    input: { 
+      subject: 'App crashes on startup',
+      content: 'The app immediately crashes when I open it.' 
+    },
+    expected: { category: 'bug_report' },
+  },
+];
+```
 
-```ts /src/prompts/email-summarizer.prompt.ts
+These become your ground truth for evaluation. Learn more in the [Evaluate](/ai-engineering/evaluate/overview) section.
+
+## Structured prompt management
+
+<Note>
+The features below are experimental. Axiom’s current focus is on the evaluation and observability stages of the AI engineering workflow.
+</Note>
+
+For teams wanting more structure around prompt definitions, Axiom’s SDK includes experimental utilities for managing prompts as versioned objects.
+
+### Define prompts as objects
+
+Represent capabilities as structured `Prompt` objects:
+
+```ts src/prompts/ticket-classifier.prompt.ts
 import { 
   experimental_Type,
   type experimental_Prompt 
 } from 'axiom/ai';
 
-export const emailSummarizerPrompt = {
-  name: "Email Summarizer",
-  slug: "email-summarizer",
+export const ticketClassifierPrompt = {
+  name: "Ticket Classifier",
+  slug: "ticket-classifier",
   version: "1.0.0",
-  model: "gpt-4o",
+  model: "gpt-4o-mini",
   messages: [
     {
       role: "system",
-      content:
-        `Summarize emails concisely, highlighting action items.
-        The user is named {{ username }}.`,
+      content: "Classify support tickets as: {{ categories }}",
     },
     {
       role: "user",
-      content: "Please summarize this email: {{ email_content }}",
+      content: "{{ ticket_content }}",
     },
   ],
   arguments: {
-    username: experimental_Type.String(),
-    email_content: experimental_Type.String(),
+    categories: experimental_Type.String(),
+    ticket_content: experimental_Type.String(),
   },
 } satisfies experimental_Prompt;
 ```
 
-### Strongly-typed arguments with `Template`
-
-To ensure that prompts are used correctly, the Axiom’s AI SDK includes a `Template` type system (exported as `Type`) for defining the schema of a prompt’s `arguments`. This provides type safety, autocompletion, and a clear, self-documenting definition of what data the prompt expects.
-
-The `arguments` object uses `Template` helpers to define the shape of the context:
-
-```typescript /src/prompts/report-generator.prompt.ts
-import { 
-  experimental_Type,
-  type experimental_Prompt
-} from 'axiom/ai';
-
-export const reportGeneratorPrompt = {
-  // ... other properties
-  arguments: {
-    company: experimental_Type.Object({
-      name: experimental_Type.String(),
-      isActive: experimental_Type.Boolean(),
-      departments: experimental_Type.Array(
-        experimental_Type.Object({
-          name: experimental_Type.String(),
-          budget: experimental_Type.Number(),
-        })
-      ),
-    }),
-    priority: experimental_Type.Union([
-      experimental_Type.Literal("high"),
-      experimental_Type.Literal("medium"),
-      experimental_Type.Literal("low"),
-    ]),
-  },
-} satisfies experimental_Prompt;
+### Type-safe arguments
+
+The `experimental_Type` system provides type safety for prompt arguments:
+
+```ts
+arguments: {
+  user: experimental_Type.Object({
+    name: experimental_Type.String(),
+    preferences: experimental_Type.Array(experimental_Type.String()),
+  }),
+  priority: experimental_Type.Union([
+    experimental_Type.Literal("high"),
+    experimental_Type.Literal("medium"),
+    experimental_Type.Literal("low"),
+  ]),
+}
 ```
 
-You can even infer the exact TypeScript type for a prompt’s context using the `InferContext` utility.
-
-### Prototyping and local testing
+### Local testing
 
-Before using a prompt in your application, you can test it locally using the `parse` function. This function takes a `Prompt` object and a `context` object, rendering the templated messages to verify the output. This is a quick way to ensure your templating logic is correct.
+Test prompts locally before using them:
 
-```typescript
+```ts
 import { experimental_parse } from 'axiom/ai';
-import { 
-  reportGeneratorPrompt
-} from './prompts/report-generator.prompt';
-
-const context = {
-  company: {
-    name: 'Axiom',
-    isActive: true,
-    departments: [
-      { name: 'Engineering', budget: 500000 },
-      { name: 'Marketing', budget: 150000 },
-    ],
-  },
-  priority: 'high' as const,
-};
-
-// Render the prompt with the given context
-const parsedPrompt = await experimental_parse(
-  reportGeneratorPrompt, { context }
-);
-
-console.log(parsedPrompt.messages);
-// [
-//   {
-//     role: 'system',
-//     content: 'Generate a report for Axiom.\nCompany Status: Active...'
-//   }
-// ]
-```
 
-### Managing prompts with Axiom
+const parsed = await experimental_parse(ticketClassifierPrompt, {
+  context: {
+    categories: 'question, bug_report, feature_request',
+    ticket_content: 'How do I reset my password?',
+  },
+});
 
-To enable more advanced workflows and collaboration, Axiom is building tools to manage your prompt assets centrally.
+console.log(parsed.messages);
+```
 
-* <Badge>Coming soon</Badge> The `axiom` CLI will allow you to `push`, `pull`, and `list` prompt versions directly from your terminal, synchronizing your local files with the Axiom platform.
-* <Badge>Coming soon</Badge> The SDK will include methods like `axiom.prompts.create()` and `axiom.prompts.load()` for programmatic access to your managed prompts. This will be the foundation for A/B testing, version comparison, and deploying new prompts without changing your application code.
+These utilities help organize prompts in your codebase. Centralized prompt management and versioning features may be added in future releases.
 
-### What’s next?
+## What's next?
 
-Now that you’ve created and structured your capability, the next step is to measure its quality against a set of known good examples.
+Once you have a working capability and reference examples, systematically evaluate its performance.
 
-Learn more about this step of the AI engineering workflow in the [Measure](/ai-engineering/measure) docs.
+To learn how to set up and run evaluations, see [Evaluate](/ai-engineering/evaluate/overview).