You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore(mlobs): add initial ray tracing hook (#14135)
## Overview
This PR adds a tracing startup hook for the Ray ML Framework as
described in
[MLOB-3238](https://datadoghq.atlassian.net/browse/MLOB-3238), which
lays the groundwork for the [Ray](https://github.com/ray-project/ray)
distributed AI training observability MVP.
After discussing with @brettlangdon and @mabdinur we are favoring this
approach of passing a "dummy" tracing hook to ray to trigger a
ddtrace.auto import over our previous approach in [feat(aiobs): add
tracing hook for ray ml
framework](#14038) of
implementing an actual tracing hook. However, we do still set up a
filter to add and modify tags on incoming spans here.
The tracing hook can be passed to the ray start command via
`ray start --head
--tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing`
For example:
`ray start --head
--tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing`
## Motivation
As described in [Distributed AI Observability
Proposal](https://docs.google.com/document/d/1yWTlN3fpwywEuyeec0tNU8NxIRvPenybR6NlMlw3Zs0/edit?usp=sharing)
we would like to provide comprehensive real-time monitoring and root
cause analysis (RCA) for distributed training workloads to our
customers. Specifically, for the 2025 Q3 MVP we would like to be able to
collect and report training jobs and the trace, logs, and metrics
associated with each job, and we would like report the status of each
job including whether it was a success or failure. This PR provides
initial functionality for exporting tracing with ddtrace.auto via a
dummy Ray startup tracing hook. We will explore expanding this approach
further to cover the remaining scope for the MVP backend.
## Test plan
Build and install ddtrace with these changes. Before this PR lands and
rolls out in a new release of ddtrace this could be achieved via the
following steps:
Modify your Datadog Agent settings to [point to
staging](https://docs.google.com/document/d/1u5lhTMRi_gOFdT7ofzl4TRXA7mY3E5Ej6mZU67Xn7oI/edit?usp=sharing)
adding your own Staging API keys in place of `<staging_api_key>`, but
leave your existing `run_path` line and everything below at the bottom
if you have it.
Download the hello.py script linked in
[MLOB-2922](https://datadoghq.atlassian.net/browse/MLOB-2922).
```
pyenv install 3.12.8
pyenv virtualenv 3.12.8 ray-test
pyenv activate ray-test
pip install -U "ray[train]" torch torchvision
pip install opentelemetry-sdk
git clone -b imran-hendley/add-ray-autotracing --single-branch https://github.com/DataDog/dd-trace-py.git
pip install ./dd-trace-py
export DD_SERVICE="my-ray-job"
export DD_TRACE_OTEL_ENABLED="true"
ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing
python hello.py
```
Set the time window to "Live Past 15 minutes" and search for
"service:my-ray-job" in the Staging APM > Traces > Explorer search box
at https://dd.datad0g.com/apm/traces and observe the list of traces
captured from the toy training job kicked off in hello.py.
<img width="1572" height="714" alt="Screenshot 2025-07-25 at 11 54
05 AM"
src="https://github.com/user-attachments/assets/6d953bd6-2a1c-42e5-8057-9a9f4aba027f"
/>
To test the fallback job name run:
```
ray stop
unset DD_SERVICE
ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing
python hello.py
```
Then set the time window to "Live Past 15 minutes" and search for
service:unspecified-ray-job in the Staging traces explorer search box.
<img width="1577" height="1004" alt="Screenshot 2025-07-25 at 12 00
19 PM"
src="https://github.com/user-attachments/assets/3fd00e19-92ca-4f89-bac9-0cdbcb393187"
/>
## Risk assessment
This is a low-risk change because we are adding a feature which must be
invoked manually at this point and does not run by default.
## Release Notes
This change lays the groundwork for an unreleased feature and does not
affect public facing APIs, so no release notes are needed.
## Documentation
[Distributed AI Observability
Proposal](https://docs.google.com/document/d/1yWTlN3fpwywEuyeec0tNU8NxIRvPenybR6NlMlw3Zs0/edit?usp=sharing)
[[RFC] Distributed AI
Observability](https://docs.google.com/document/d/1AGR2KQLaFgVbuOi4Ymdt7FJinmljc-G_tfYP89ihuPo/edit?usp=sharing)
## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
[MLOB-3238]:
https://datadoghq.atlassian.net/browse/MLOB-3238?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[MLOB-2922]:
https://datadoghq.atlassian.net/browse/MLOB-2922?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
0 commit comments