generated from mintlify/starter
-
Notifications
You must be signed in to change notification settings - Fork 13
Add evals to AI engineering #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
2ae1a63
initial eval docs
c-ehrlich a082b90
add note about instrumentation fn
c-ehrlich 7df0bdb
Stylistic fixes
manototh 0254557
Quick fixes
manototh 686a53e
Merge branch 'main' into evals-1
manototh 7b8bd25
Fixes
manototh 2251591
Add keywords
manototh 2c662b2
Restructure Measure page
manototh 95d4c5c
Implement review
manototh 55e6bf4
Refactor
manototh 3e3050c
Update measure.mdx
manototh 89ce5ca
Update measure.mdx
manototh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,85 +1,234 @@ | ||
| --- | ||
| title: "Measure" | ||
| description: "Learn how to measure the quality of your AI capabilities by running evaluations against ground truth data." | ||
| keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "graders"] | ||
| keywords: ["ai engineering", "AI engineering", "measure", "evals", "evaluation", "scoring", "scorers", "graders", "scores"] | ||
| --- | ||
|
|
||
| import { Badge } from "/snippets/badge.jsx" | ||
| import { definitions } from '/snippets/definitions.mdx' | ||
| import { definitions } from "/snippets/definitions.mdx" | ||
|
|
||
| <Warning> | ||
| The evaluation framework described here is in active development. Axiom is working with design partners to shape what’s built. [Contact Axiom](https://www.axiom.co/contact) to get early access and join a focused group of teams shaping these tools. | ||
| </Warning> | ||
|
|
||
| The **Measure** stage is where you quantify the quality and effectiveness of your AI <Tooltip tip={definitions.Capability}>capability</Tooltip>. Instead of relying on anecdotal checks, this stage uses a systematic process called an <Tooltip tip={definitions.Eval}>eval</Tooltip> to score your capability’s performance against a known set of correct examples (<Tooltip tip={definitions.GroundTruth}>ground truth</Tooltip>). This provides a data-driven benchmark to ensure a capability is ready for production and to track its quality over time. | ||
|
|
||
| ## The `Eval` function | ||
| Evaluations (evals) are systematic tests that measure how well your AI features perform. Instead of manually testing AI outputs, evals automatically run your AI code against test datasets and score the results using custom metrics. This lets you catch regressions, compare different approaches, and confidently improve your AI features over time. | ||
|
|
||
| <Badge>Coming soon</Badge> The primary tool for the Measure stage is the `Eval` function, which will be available in the `axiom/ai` package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase. | ||
| ## Prerequisites | ||
|
|
||
| An `Eval` is structured around a few key parameters: | ||
| Follow the [Quickstart](/ai-engineering/quickstart): | ||
| - To run evals within the context of an existing AI app, follow the instrumentation setup in the [Quickstart](/ai-engineering/quickstart). | ||
| - To run evals without an existing AI app, skip the part in the Quickstart about instrumentalising your app. | ||
|
|
||
| * `data`: An async function that returns your `collection` of `{ input, expected }` pairs, which serve as your ground truth. | ||
| * `task`: The function that executes your AI capability, taking an `input` and producing an `output`. | ||
| * `scorers`: An array of `grader` functions that score the `output` against the `expected` value. | ||
| * `threshold`: A score between 0 and 1 that determines the pass/fail condition for the evaluation. | ||
| ## Write evalulation function | ||
|
|
||
| Here is an example of a complete evaluation suite: | ||
| The `Eval` function provides a simple, declarative way to define a test suite for your capability directly in your codebase. | ||
|
|
||
| ```ts /evals/text-match.eval.ts | ||
| import { Levenshtein } from 'autoevals'; | ||
| import { Eval } from 'axiom/ai/evals'; | ||
| The key parameters of the `Eval` function: | ||
|
|
||
| Eval('text-match-eval', { | ||
| // 1. Your ground truth dataset | ||
| data: async () => { | ||
| return [ | ||
| - `data`: An async function that returns your collection of `{ input, expected }` pairs, which serve as your ground truth. | ||
| - `task`: The function that executes your AI capability, taking an `input` and producing an `output`. | ||
| - `scorers`: An array of scorer functions that score the `output` against the `expected` value. | ||
| - `metadata`: Optional metadata for the evaluation, such as a description. | ||
|
|
||
| The example below creates an evaluation for a support ticket classification system in the file `/src/evals/ticket-classification.eval.ts`. | ||
|
|
||
| ```ts /src/evals/ticket-classification.eval.ts expandable | ||
| import { experimental_Eval as Eval, Scorer } from 'axiom/ai/evals'; | ||
| import { generateObject } from 'ai'; | ||
| import { openai } from '@ai-sdk/openai'; | ||
| import { wrapAISDKModel } from 'axiom/ai'; | ||
| import { flag, pickFlags } from '../lib/app-scope'; | ||
| import { z } from 'zod'; | ||
|
|
||
| // The function you want to evaluate | ||
| async function classifyTicket({ subject, content }: { subject?: string; content: string }) { | ||
| const model = flag('ticketClassification.model'); | ||
|
|
||
| const result = await generateObject({ | ||
| model: wrapAISDKModel(openai(model)), | ||
| messages: [ | ||
| { | ||
| input: 'test', | ||
| expected: 'hi, test!', | ||
| role: 'system', | ||
| content: `You are a customer support engineer classifying tickets as: spam, question, feature_request, or bug_report. | ||
| If spam, return a polite auto-close message. Otherwise, say a team member will respond shortly.`, | ||
| }, | ||
| { | ||
| input: 'foobar', | ||
| expected: 'hello, foobar!', | ||
| role: 'user', | ||
| content: subject ? `Subject: ${subject}\n\n${content}` : content, | ||
| }, | ||
| ], | ||
| schema: z.object({ | ||
| category: z.enum(['spam', 'question', 'feature_request', 'bug_report']), | ||
| response: z.string() | ||
| }), | ||
| }); | ||
|
|
||
| return result.object; | ||
| } | ||
|
|
||
| // Custom exact-match scorer that returns score and metadata | ||
| const ExactMatchScorer = Scorer( | ||
| 'Exact-Match', | ||
| ({ output, expected }: { output: { response: string }; expected: { response: string } }) => { | ||
| const normalizedOutput = output.response.trim().toLowerCase(); | ||
| const normalizedExpected = expected.response.trim().toLowerCase(); | ||
|
|
||
| return { | ||
| score: normalizedOutput === normalizedExpected, | ||
| metadata: { | ||
| details: 'A scorer that checks for exact match', | ||
| }, | ||
| }; | ||
| }); | ||
| } | ||
| ); | ||
|
|
||
| // Custom spam classification scorer | ||
| const SpamClassificationScorer = Scorer( | ||
| "Spam-Classification", | ||
| ({ output, expected }: { | ||
| output: { category: string }; | ||
| expected: { category: string }; | ||
| }) => { | ||
| const isSpam = (item: { category: string }) => item.category === "spam"; | ||
| return isSpam(output) === isSpam(expected) ? 1 : 0; | ||
| } | ||
| ); | ||
|
|
||
| // Define the evaluation | ||
| Eval('spam-classification', { | ||
| // Specify which flags this eval uses | ||
| configFlags: pickFlags('ticketClassification'), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is only defined / explained further down the page. I understand why, and don't really have a better solution, but still feels weird. |
||
|
|
||
| // Test data with input/expected pairs | ||
| data: () => [ | ||
| { | ||
| input: { | ||
| subject: "Congratulations! You've Been Selected for an Exclusive Reward", | ||
| content: 'Claim your $500 gift card now by clicking this link!', | ||
| }, | ||
| expected: { | ||
| category: 'spam', | ||
| response: "We're sorry, but your message has been automatically closed.", | ||
| }, | ||
| }, | ||
| { | ||
| input: { | ||
| subject: 'FREE CA$H', | ||
| content: 'BUY NOW ON WWW.BEST-DEALS.COM!', | ||
| }, | ||
| expected: { | ||
| category: 'spam', | ||
| response: "We're sorry, but your message has been automatically closed.", | ||
| }, | ||
| ]; | ||
| }, | ||
| ], | ||
|
|
||
| // The task to run for each test case | ||
| task: async ({ input }) => { | ||
| return await classifyTicket(input); | ||
| }, | ||
|
|
||
| // 2. The task that runs your capability | ||
| task: async (input: string) => { | ||
| return `hi, ${input}!`; | ||
|
|
||
| // Scorers to measure performance | ||
| scorers: [SpamClassificationScorer, ExactMatchScorer], | ||
|
|
||
| // Optional metadata | ||
| metadata: { | ||
| description: 'Classify support tickets as spam or not spam', | ||
| }, | ||
| }); | ||
| ``` | ||
|
|
||
| ## Set up flags | ||
|
|
||
| // 3. The scorers that grade the output | ||
| scorers: [Levenshtein], | ||
| Create the file `src/lib/app-scope.ts`: | ||
|
|
||
| // 4. The pass/fail threshold for the scores | ||
| threshold: 1, | ||
| ```ts /src/lib/app-scope.ts | ||
| import { createAppScope } from 'axiom/ai/evals'; | ||
| import { z } from 'zod'; | ||
|
|
||
| export const flagSchema = z.object({ | ||
| ticketClassification: z.object({ | ||
| model: z.string().default('gpt-4o-mini'), | ||
| }), | ||
| }); | ||
|
|
||
| const { flag, pickFlags } = createAppScope({ flagSchema }); | ||
|
|
||
| export { flag, pickFlags }; | ||
| ``` | ||
|
|
||
| ## Grading with scorers | ||
| ## Run evaluations | ||
|
|
||
| <Badge>Coming soon</Badge> A <Tooltip tip={definitions.Grader}>grader</Tooltip> is a function that scores a capability’s output. Axiom will provide a library of built-in scorers for common tasks (e.g., checking for semantic similarity, factual correctness, or JSON validity). You can also provide your own custom functions to measure domain-specific logic. Each scorer receives the `input`, the generated `output`, and the `expected` value, and must return a score. | ||
| To run your evaluation suites from your terminal, [install the Axiom CLI](/reference/cli) and use the following commands. | ||
|
|
||
| ## Running evaluations | ||
| | Description | Command | | ||
| | ----------- | ------- | | ||
| | Run all evals | `axiom eval` | | ||
| | Run specific eval file | `axiom eval src/evals/ticket-classification.eval.ts` | | ||
| | Run evals matching a glob pattern | `axiom eval "**/*spam*.eval.ts"` | | ||
| | Run eval by name | `axiom eval "spam-classification"` | | ||
| | List available evals without running | `axiom eval --list` | | ||
|
|
||
| <Badge>Coming soon</Badge> You will run your evaluation suites from your terminal using the `axiom` CLI. | ||
| ## Analyze results in Console | ||
|
|
||
| ```bash | ||
| axiom run evals/text-match.eval.ts | ||
| ``` | ||
| When you run an eval, Axiom AI SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. Axiom enriches the traces with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console. | ||
|
|
||
| The results of evals: | ||
| - Pass/fail status for each test case | ||
| - Scores from each scorer | ||
| - Comparison to baseline (if available) | ||
| - Links to view detailed traces in Axiom | ||
|
|
||
| The Console features leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements. | ||
|
|
||
| ## Additional configuration options | ||
|
|
||
| ### Custom scorers | ||
|
|
||
| A scorer is a function that scores a capability’s output. Scorers receive the `input`, the generated `output`, and the `expected` value, and return a score. | ||
|
|
||
| The example above uses two custom scorers. Scorers can return metadata alongside the score. | ||
|
|
||
| This command will execute the specified test file using `vitest` in the background. Note that `vitest` will be a peer dependency for this functionality. | ||
| You can use the [`autoevals` library](https://github.com/braintrustdata/autoevals) instead of custom scorers. `autoevals` provides prebuilt scorers for common tasks like semantic similarity, factual correctness, and text matching. | ||
|
|
||
| ## Analyzing results in the console | ||
| ### Run experiments | ||
|
|
||
| <Badge>Coming soon</Badge> When you run an <Tooltip tip={definitions.Eval}>eval</Tooltip>, the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with `eval.*` attributes, allowing you to deeply analyze results in the Axiom Console. | ||
| Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas, and you can override them at runtime. | ||
|
|
||
| The Console will feature leaderboards and comparison views to track score progression across different versions of a capability, helping you verify that your changes are leading to measurable improvements. | ||
| The example above uses the `ticketClassification` flag to test different language models. Flags have a default value that you can override at runtime in one of the following ways: | ||
|
|
||
| - Override flags directly when you run the eval: | ||
|
|
||
| ```bash | ||
| axiom eval --flag.ticketClassification.model=gpt-4o | ||
| ``` | ||
|
|
||
| - Alternatively, specify the flag overrides in a JSON file. | ||
|
|
||
| ```json experiment.json | ||
| { | ||
| "ticketClassification": { | ||
| "model": "gpt-4o" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| And then specify the JSON file as the value of the `flags-config` parameter when you run the eval: | ||
|
|
||
| ```bash | ||
| axiom eval --flags-config=experiment.json | ||
| ``` | ||
|
|
||
| ## What’s next? | ||
|
|
||
| Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. The next step is to monitor its performance with real-world traffic. | ||
| A capability is ready to be deployed when it meets your quality benchmarks. After deployment, the next steps can be the following: | ||
|
|
||
| - **Baseline comparisons**: Run evals multiple times to track regression over time. | ||
| - **Experiment with flags**: Test different models or strategies using flag overrides. | ||
| - **Advanced scorers**: Build custom scorers for domain-specific metrics. | ||
| - **CI/CD integration**: Add `axiom eval` to your CI pipeline to catch regressions. | ||
|
|
||
| Learn more about this step of the AI engineering workflow in the [Observe](/ai-engineering/observe) docs. | ||
| The next step is to monitor your capability’s performance with real-world traffic. To learn more about this step of the AI engineering workflow, see [Observe](/ai-engineering/observe). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: need to better document
configFlags