-
Notifications
You must be signed in to change notification settings - Fork 468
feat(openai): instrument openai responses prompts #15159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(openai): instrument openai responses prompts #15159
Conversation
…gration This update introduces the ability to capture prompt metadata (id, version, variables) for reusable prompts in the OpenAI integration. The changes include enhancements to the `openai_set_meta_tags_from_response` function to validate and store prompt data, as well as new tests to ensure the correct functionality. A new YAML cassette for testing responses with prompt tracking has also been added.
…uctions This update introduces a new function, `_extract_chat_template_from_instructions`, which extracts chat templates from OpenAI response instructions by replacing variable values with placeholders. Additionally, the `openai_set_meta_tags_from_response` function has been modified to utilize this new functionality, ensuring that chat templates are included in the prompt data. Tests have been added to verify the correct extraction and formatting of chat templates, including variable placeholders.
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 208 ± 3 ms. The average import time from base is: 215 ± 3 ms. The import time difference between this PR and base is: -7.1 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate alex/MLOB-4411_instrument-openai-responses-prompts (9f6d0df) with baseline main (6aa3d1a) 🟡 Near SLO Breach (4 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 19.260ms (SLO: <22.300ms 📉 -13.6%) vs baseline: ~same Memory: ✅ 66.022MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.7% ✅ exception-replay-enabledTime: ✅ 1.338ms (SLO: <1.450ms -7.7%) vs baseline: ~same Memory: ✅ 64.205MB (SLO: <67.000MB -4.2%) vs baseline: +4.9% ✅ iastTime: ✅ 19.256ms (SLO: <22.250ms 📉 -13.5%) vs baseline: ~same Memory: ✅ 66.020MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.9% ✅ profilerTime: ✅ 15.405ms (SLO: <16.550ms -6.9%) vs baseline: +0.2% Memory: ✅ 53.794MB (SLO: <54.500MB 🟡 -1.3%) vs baseline: +4.5% ✅ resource-renamingTime: ✅ 19.299ms (SLO: <21.750ms 📉 -11.3%) vs baseline: -0.1% Memory: ✅ 65.998MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.9% ✅ span-code-originTime: ✅ 22.811ms (SLO: <28.200ms 📉 -19.1%) vs baseline: ~same Memory: ✅ 67.108MB (SLO: <69.500MB -3.4%) vs baseline: +4.4% ✅ tracerTime: ✅ 19.240ms (SLO: <21.750ms 📉 -11.5%) vs baseline: ~same Memory: ✅ 66.027MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ tracer-and-profilerTime: ✅ 21.139ms (SLO: <23.500ms 📉 -10.0%) vs baseline: -0.2% Memory: ✅ 67.845MB (SLO: <68.000MB 🟡 -0.2%) vs baseline: +5.0% ✅ tracer-dont-create-db-spansTime: ✅ 19.323ms (SLO: <21.500ms 📉 -10.1%) vs baseline: ~same Memory: ✅ 66.001MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +5.0% ✅ tracer-minimalTime: ✅ 16.584ms (SLO: <17.500ms -5.2%) vs baseline: -0.3% Memory: ✅ 65.978MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ tracer-nativeTime: ✅ 19.291ms (SLO: <21.750ms 📉 -11.3%) vs baseline: +0.2% Memory: ✅ 67.754MB (SLO: <72.500MB -6.5%) vs baseline: +4.9% ✅ tracer-no-cachesTime: ✅ 17.268ms (SLO: <19.650ms 📉 -12.1%) vs baseline: ~same Memory: ✅ 65.984MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.7% ✅ tracer-no-databasesTime: ✅ 18.782ms (SLO: <20.100ms -6.6%) vs baseline: -0.2% Memory: ✅ 65.969MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.9% ✅ tracer-no-middlewareTime: ✅ 18.958ms (SLO: <21.500ms 📉 -11.8%) vs baseline: -0.1% Memory: ✅ 66.051MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +5.0% ✅ tracer-no-templatesTime: ✅ 19.064ms (SLO: <22.000ms 📉 -13.3%) vs baseline: -0.3% Memory: ✅ 65.994MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.7% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 16.525ms (SLO: <19.850ms 📉 -16.8%) vs baseline: +1.2% Memory: ✅ 65.707MB (SLO: <66.500MB 🟡 -1.2%) vs baseline: +5.0% ✅ errortracking-enabled-userTime: ✅ 16.515ms (SLO: <19.400ms 📉 -14.9%) vs baseline: +1.1% Memory: ✅ 65.706MB (SLO: <66.500MB 🟡 -1.2%) vs baseline: +5.1% ✅ tracer-enabledTime: ✅ 16.359ms (SLO: <19.450ms 📉 -15.9%) vs baseline: +0.4% Memory: ✅ 65.702MB (SLO: <66.500MB 🟡 -1.2%) vs baseline: +5.0% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.591ms (SLO: <4.750ms -3.4%) vs baseline: -0.4% Memory: ✅ 62.026MB (SLO: <65.000MB -4.6%) vs baseline: +4.7% ✅ appsec-postTime: ✅ 6.608ms (SLO: <6.750ms -2.1%) vs baseline: ~same Memory: ✅ 62.142MB (SLO: <65.000MB -4.4%) vs baseline: +4.9% ✅ appsec-telemetryTime: ✅ 4.584ms (SLO: <4.750ms -3.5%) vs baseline: ~same Memory: ✅ 62.028MB (SLO: <65.000MB -4.6%) vs baseline: +4.8% ✅ debuggerTime: ✅ 1.857ms (SLO: <2.000ms -7.1%) vs baseline: ~same Memory: ✅ 45.259MB (SLO: <47.000MB -3.7%) vs baseline: +4.9% ✅ iast-getTime: ✅ 1.856ms (SLO: <2.000ms -7.2%) vs baseline: -0.5% Memory: ✅ 41.892MB (SLO: <49.000MB 📉 -14.5%) vs baseline: +4.9% ✅ profilerTime: ✅ 1.914ms (SLO: <2.100ms -8.9%) vs baseline: -0.2% Memory: ✅ 46.551MB (SLO: <47.000MB 🟡 -1.0%) vs baseline: +4.8% ✅ resource-renamingTime: ✅ 3.371ms (SLO: <3.650ms -7.6%) vs baseline: ~same Memory: ✅ 52.474MB (SLO: <53.500MB 🟡 -1.9%) vs baseline: +4.6% ✅ tracerTime: ✅ 3.360ms (SLO: <3.650ms -8.0%) vs baseline: -0.3% Memory: ✅ 52.357MB (SLO: <53.500MB -2.1%) vs baseline: +4.9% ✅ tracer-nativeTime: ✅ 3.359ms (SLO: <3.650ms -8.0%) vs baseline: ~same Memory: ✅ 54.013MB (SLO: <60.000MB -10.0%) vs baseline: +4.8% 🟡 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 2.981µs (SLO: <20.000µs 📉 -85.1%) vs baseline: -0.8% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 201.167µs (SLO: <220.000µs -8.6%) vs baseline: -1.2% Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.9% ✅ 1-distribution-metric-1-timesTime: ✅ 3.368µs (SLO: <20.000µs 📉 -83.2%) vs baseline: +0.3% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.6% ✅ 1-distribution-metrics-100-timesTime: ✅ 216.032µs (SLO: <220.000µs 🟡 -1.8%) vs baseline: -0.2% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.220µs (SLO: <20.000µs 📉 -88.9%) vs baseline: -1.1% Memory: ✅ 31.497MB (SLO: <34.000MB -7.4%) vs baseline: +4.8% ✅ 1-gauge-metrics-100-timesTime: ✅ 138.313µs (SLO: <150.000µs -7.8%) vs baseline: ~same Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.9% ✅ 1-rate-metric-1-timesTime: ✅ 3.390µs (SLO: <20.000µs 📉 -83.0%) vs baseline: +8.4% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.6% ✅ 1-rate-metrics-100-timesTime: ✅ 217.626µs (SLO: <250.000µs 📉 -12.9%) vs baseline: +0.7% Memory: ✅ 31.477MB (SLO: <34.000MB -7.4%) vs baseline: +4.5% ✅ 100-count-metrics-100-timesTime: ✅ 20.627ms (SLO: <22.000ms -6.2%) vs baseline: +1.8% Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.6% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.262ms (SLO: <2.300ms 🟡 -1.7%) vs baseline: +0.2% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.8% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.432ms (SLO: <1.550ms -7.6%) vs baseline: +1.6% Memory: ✅ 31.556MB (SLO: <34.000MB -7.2%) vs baseline: +4.5% ✅ 100-rate-metrics-100-timesTime: ✅ 2.265ms (SLO: <2.550ms 📉 -11.2%) vs baseline: +2.2% Memory: ✅ 31.497MB (SLO: <34.000MB -7.4%) vs baseline: +4.7% ✅ flush-1-metricTime: ✅ 4.591µs (SLO: <20.000µs 📉 -77.0%) vs baseline: +1.3% Memory: ✅ 31.909MB (SLO: <34.000MB -6.1%) vs baseline: +4.7% ✅ flush-100-metricsTime: ✅ 176.777µs (SLO: <250.000µs 📉 -29.3%) vs baseline: +0.8% Memory: ✅ 31.949MB (SLO: <34.000MB -6.0%) vs baseline: +4.9% ✅ flush-1000-metricsTime: ✅ 2.147ms (SLO: <2.500ms 📉 -14.1%) vs baseline: +0.7% Memory: ✅ 32.775MB (SLO: <34.500MB -5.0%) vs baseline: +5.2%
|
brettlangdon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release note lgtm
Kyle-Verhoog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work @PROFeNoM! 👏
Implementation is simple, good test coverage and manual validation. One thing I noticed in your staging link is that we're not getting the prompt section rendering in the Overview page altho I do see the prompt tags coming in 🤔.
|
Hey @Kyle-Verhoog I get the following when clicking the staging link: Is the feature flag enabled on your end? The FF thingy is pretty new to me, so perhaps there is a way I'm unaware of to automatically make it available from the link 🤔 |

PR Description
Description
Adds prompt tracking for OpenAI reusable prompts.
The problem: OpenAI returns rendered prompts (with variables filled in), but prompt tracking needs templates with placeholders like
{{variable_name}}.The solution: Reverse templating - reconstruct the template by replacing variable values with placeholders.
How it works:
Why longest values first?
Overlapping values need careful handling:
The implementation uses a simple
.replace()loop with longest-first sorting. Benchmarks show this is faster than regex for typical prompts with <50 variables.Testing
test_response_with_prompt_tracking()verifying prompt metadata, chat_template extraction, and placeholder replacement._extract_chat_template_from_instructions()covering edge cases (overlaps, special chars, large patterns, etc.)Risks
Making this perfect is likely impossible since we're reverse-engineering the template from rendered output. The approach works well for typical real-world usage where:
For instance, when two variables have the same value, only one placeholder will be used:
Additional Notes
OpenAI doesn't expose templates via API, so we reconstruct them. If they add template retrieval later or backend supports template-less prompts, we can remove this logic.