Skip to content

Conversation

ehhuang
Copy link
Contributor

@ehhuang ehhuang commented Oct 8, 2025

What does this PR do?

As discussed on discord, we do not need to reinvent the wheel for telemetry. Instead we'll lean into the canonical OTEL stack. Logs/traces/metrics will still be sent via OTEL - they just won't be stored on, queried through Stack.

This is the first of many PRs to remove telemetry API from Stack.

  1. removed webmethod decorators to remove from API spec
  2. removed tests as @iamemilio is adding them on otel directly.

Test Plan

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@leseb
Copy link
Collaborator

leseb commented Oct 9, 2025

@ehhuang, rationale for this? I knew this was floating around. I remember seeing a comment from @ashwinb somewhere about this. Thanks!

@iamemilio
Copy link
Contributor

We are moving away from having our own APIs for serving telemetry, and our own logic for storing it. We want to defer that responsibility to purpose built services for handling telemetry data in production environments like jaeger, prometheus, datadog, etc. This is a best practice and a design pattern a user would recognize and expect.

@ashwinb
Copy link
Contributor

ashwinb commented Oct 9, 2025

@ehhuang, rationale for this? I knew this was floating around. I remember seeing a comment from @ashwinb somewhere about this. Thanks!

@leseb here's a thread in Discord: https://discord.com/channels/1257833999603335178/1422318785175748709

what is your opinion?

I feel like a semantic opinionated API like "get me token usage" could be useful in simplifying some of the metrics experience for users (i.e., the Stack would require in its configuration the correct sinks for OTEL and appropriate providers to query for metrics) but there's also the potential danger of needing to cover a lot of territory just for this because people's logging and metrics collection setup can be very very different.

@iamemilio
Copy link
Contributor

iamemilio commented Oct 9, 2025

Instead of building our own syntax, and APIs, what we should do is set a clear and consistent standard for how we name, label and capture telemetry data. Things like metrics can be queried in services like prometheus, which already have their own query language and APIs, and if you want to simplify the user experience or show people a reference architecture, I would extend what we have in the telemetry scripts to include a Grafana dashboard that shows off some of the metrics and trace data we capture. This lets us still have a "simple" experience, while still conforming to standards that our customers expect and use.

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also docs/docs/building_applications/telemetry.mdx that might need updating or deleting.


@runtime_checkable
class Telemetry(Protocol):
@webmethod(route="/telemetry/events", method="POST", level=LLAMA_STACK_API_V1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably follow: https://llamastack.github.io/docs/concepts/apis/api_leveling#approval-and-announcement-process-for-breaking-changes in this case since we are flat out deleting an API, some sort of announcement on discord and in the upcoming release notes would be great

@ehhuang ehhuang changed the title chore!: remove ALL telemetry APIs chore(api)!: BREAKING CHANGE: remove ALL telemetry APIs Oct 10, 2025
@leseb
Copy link
Collaborator

leseb commented Oct 14, 2025

@ehhuang, rationale for this? I knew this was floating around. I remember seeing a comment from @ashwinb somewhere about this. Thanks!

@leseb here's a thread in Discord: https://discord.com/channels/1257833999603335178/1422318785175748709

what is your opinion?

I feel like a semantic opinionated API like "get me token usage" could be useful in simplifying some of the metrics experience for users (i.e., the Stack would require in its configuration the correct sinks for OTEL and appropriate providers to query for metrics) but there's also the potential danger of needing to cover a lot of territory just for this because people's logging and metrics collection setup can be very very different.

Thanks for the additional context. That sounds good to me for now. Having a solid internal telemetry implementation is more important than the API surface at this point. The API is something we can revisit later.

REQUIRED_SCOPE = "telemetry.read"


@json_schema_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to remove from Stainless Spec as well

# What does this PR do?


## Test Plan
@ehhuang ehhuang merged commit 866c13c into main Oct 14, 2025
26 of 43 checks passed
@ehhuang ehhuang deleted the pr3740 branch October 14, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants