Skip to content

Commit 0272b58

Browse files
chore(mlobs): add initial ray tracing hook (#14135)
## Overview This PR adds a tracing startup hook for the Ray ML Framework as described in [MLOB-3238](https://datadoghq.atlassian.net/browse/MLOB-3238), which lays the groundwork for the [Ray](https://github.com/ray-project/ray) distributed AI training observability MVP. After discussing with @brettlangdon and @mabdinur we are favoring this approach of passing a "dummy" tracing hook to ray to trigger a ddtrace.auto import over our previous approach in [feat(aiobs): add tracing hook for ray ml framework](#14038) of implementing an actual tracing hook. However, we do still set up a filter to add and modify tags on incoming spans here. The tracing hook can be passed to the ray start command via `ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing` For example: `ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing` ## Motivation As described in [Distributed AI Observability Proposal](https://docs.google.com/document/d/1yWTlN3fpwywEuyeec0tNU8NxIRvPenybR6NlMlw3Zs0/edit?usp=sharing) we would like to provide comprehensive real-time monitoring and root cause analysis (RCA) for distributed training workloads to our customers. Specifically, for the 2025 Q3 MVP we would like to be able to collect and report training jobs and the trace, logs, and metrics associated with each job, and we would like report the status of each job including whether it was a success or failure. This PR provides initial functionality for exporting tracing with ddtrace.auto via a dummy Ray startup tracing hook. We will explore expanding this approach further to cover the remaining scope for the MVP backend. ## Test plan Build and install ddtrace with these changes. Before this PR lands and rolls out in a new release of ddtrace this could be achieved via the following steps: Modify your Datadog Agent settings to [point to staging](https://docs.google.com/document/d/1u5lhTMRi_gOFdT7ofzl4TRXA7mY3E5Ej6mZU67Xn7oI/edit?usp=sharing) adding your own Staging API keys in place of `<staging_api_key>`, but leave your existing `run_path` line and everything below at the bottom if you have it. Download the hello.py script linked in [MLOB-2922](https://datadoghq.atlassian.net/browse/MLOB-2922). ``` pyenv install 3.12.8 pyenv virtualenv 3.12.8 ray-test pyenv activate ray-test pip install -U "ray[train]" torch torchvision pip install opentelemetry-sdk git clone -b imran-hendley/add-ray-autotracing --single-branch https://github.com/DataDog/dd-trace-py.git pip install ./dd-trace-py export DD_SERVICE="my-ray-job" export DD_TRACE_OTEL_ENABLED="true" ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing python hello.py ``` Set the time window to "Live Past 15 minutes" and search for "service:my-ray-job" in the Staging APM > Traces > Explorer search box at https://dd.datad0g.com/apm/traces and observe the list of traces captured from the toy training job kicked off in hello.py. <img width="1572" height="714" alt="Screenshot 2025-07-25 at 11 54 05 AM" src="https://github.com/user-attachments/assets/6d953bd6-2a1c-42e5-8057-9a9f4aba027f" /> To test the fallback job name run: ``` ray stop unset DD_SERVICE ray start --head --tracing-startup-hook=ddtrace.contrib.internal.ray.tracer:setup_tracing python hello.py ``` Then set the time window to "Live Past 15 minutes" and search for service:unspecified-ray-job in the Staging traces explorer search box. <img width="1577" height="1004" alt="Screenshot 2025-07-25 at 12 00 19 PM" src="https://github.com/user-attachments/assets/3fd00e19-92ca-4f89-bac9-0cdbcb393187" /> ## Risk assessment This is a low-risk change because we are adding a feature which must be invoked manually at this point and does not run by default. ## Release Notes This change lays the groundwork for an unreleased feature and does not affect public facing APIs, so no release notes are needed. ## Documentation [Distributed AI Observability Proposal](https://docs.google.com/document/d/1yWTlN3fpwywEuyeec0tNU8NxIRvPenybR6NlMlw3Zs0/edit?usp=sharing) [[RFC] Distributed AI Observability](https://docs.google.com/document/d/1AGR2KQLaFgVbuOi4Ymdt7FJinmljc-G_tfYP89ihuPo/edit?usp=sharing) ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) [MLOB-3238]: https://datadoghq.atlassian.net/browse/MLOB-3238?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [MLOB-2922]: https://datadoghq.atlassian.net/browse/MLOB-2922?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
1 parent 7e51f47 commit 0272b58

File tree

19 files changed

+475
-0
lines changed

19 files changed

+475
-0
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ ddtrace/contrib/internal/crewai @DataDog/ml-observ
153153
ddtrace/contrib/internal/openai_agents @DataDog/ml-observability
154154
ddtrace/contrib/internal/litellm @DataDog/ml-observability
155155
ddtrace/contrib/internal/pydantic_ai @DataDog/ml-observability
156+
ddtrace/contrib/internal/ray @DataDog/ml-observability
156157
ddtrace/contrib/internal/mcp @DataDog/ml-observability
157158
tests/llmobs @DataDog/ml-observability
158159
tests/contrib/openai @DataDog/ml-observability
@@ -172,6 +173,7 @@ tests/contrib/crewai @DataDog/ml-observ
172173
tests/contrib/openai_agents @DataDog/ml-observability
173174
tests/contrib/litellm @DataDog/ml-observability
174175
tests/contrib/pydantic_ai @DataDog/ml-observability
176+
tests/contrib/ray @DataDog/ml-observability
175177
tests/contrib/mcp @DataDog/ml-observability
176178
.gitlab/tests/llmobs.yml @DataDog/ml-observability
177179
# MLObs snapshot tests

.riot/requirements/101a8e4.txt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.12
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/101a8e4.in
6+
#
7+
attrs==25.3.0
8+
certifi==2025.7.14
9+
charset-normalizer==3.4.2
10+
click==8.2.1
11+
coverage[toml]==7.10.1
12+
filelock==3.18.0
13+
hypothesis==6.45.0
14+
idna==3.10
15+
iniconfig==2.1.0
16+
jsonschema==4.25.0
17+
jsonschema-specifications==2025.4.1
18+
mock==5.2.0
19+
msgpack==1.1.1
20+
opentracing==2.4.0
21+
packaging==25.0
22+
pluggy==1.6.0
23+
protobuf==6.31.1
24+
pygments==2.19.2
25+
pytest==8.4.1
26+
pytest-asyncio==1.1.0
27+
pytest-cov==6.2.1
28+
pytest-mock==3.14.1
29+
pyyaml==6.0.2
30+
ray==2.48.0
31+
referencing==0.36.2
32+
requests==2.32.4
33+
rpds-py==0.26.0
34+
sortedcontainers==2.4.0
35+
typing-extensions==4.14.1
36+
urllib3==2.5.0

.riot/requirements/106610a.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.9
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/106610a.in
6+
#
7+
attrs==25.3.0
8+
backports-asyncio-runner==1.2.0
9+
certifi==2025.7.14
10+
charset-normalizer==3.4.2
11+
click==8.1.8
12+
coverage[toml]==7.10.1
13+
exceptiongroup==1.3.0
14+
filelock==3.18.0
15+
hypothesis==6.45.0
16+
idna==3.10
17+
iniconfig==2.1.0
18+
jsonschema==4.25.0
19+
jsonschema-specifications==2025.4.1
20+
mock==5.2.0
21+
msgpack==1.1.1
22+
opentracing==2.4.0
23+
packaging==25.0
24+
pluggy==1.6.0
25+
protobuf==6.31.1
26+
pygments==2.19.2
27+
pytest==8.4.1
28+
pytest-asyncio==1.1.0
29+
pytest-cov==6.2.1
30+
pytest-mock==3.14.1
31+
pyyaml==6.0.2
32+
ray==2.48.0
33+
referencing==0.36.2
34+
requests==2.32.4
35+
rpds-py==0.26.0
36+
sortedcontainers==2.4.0
37+
tomli==2.2.1
38+
typing-extensions==4.14.1
39+
urllib3==2.5.0

.riot/requirements/14b9fa5.txt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.12
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/14b9fa5.in
6+
#
7+
attrs==25.3.0
8+
certifi==2025.7.14
9+
charset-normalizer==3.4.2
10+
click==8.2.1
11+
coverage[toml]==7.10.1
12+
filelock==3.18.0
13+
hypothesis==6.45.0
14+
idna==3.10
15+
iniconfig==2.1.0
16+
jsonschema==4.25.0
17+
jsonschema-specifications==2025.4.1
18+
mock==5.2.0
19+
msgpack==1.1.1
20+
opentracing==2.4.0
21+
packaging==25.0
22+
pluggy==1.6.0
23+
protobuf==6.31.1
24+
pygments==2.19.2
25+
pytest==8.4.1
26+
pytest-asyncio==1.1.0
27+
pytest-cov==6.2.1
28+
pytest-mock==3.14.1
29+
pyyaml==6.0.2
30+
ray==2.48.0
31+
referencing==0.36.2
32+
requests==2.32.4
33+
rpds-py==0.26.0
34+
sortedcontainers==2.4.0
35+
typing-extensions==4.14.1
36+
urllib3==2.5.0

.riot/requirements/14de711.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.10
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/14de711.in
6+
#
7+
attrs==25.3.0
8+
backports-asyncio-runner==1.2.0
9+
certifi==2025.7.14
10+
charset-normalizer==3.4.2
11+
click==8.2.1
12+
coverage[toml]==7.10.1
13+
exceptiongroup==1.3.0
14+
filelock==3.18.0
15+
hypothesis==6.45.0
16+
idna==3.10
17+
iniconfig==2.1.0
18+
jsonschema==4.25.0
19+
jsonschema-specifications==2025.4.1
20+
mock==5.2.0
21+
msgpack==1.1.1
22+
opentracing==2.4.0
23+
packaging==25.0
24+
pluggy==1.6.0
25+
protobuf==6.31.1
26+
pygments==2.19.2
27+
pytest==8.4.1
28+
pytest-asyncio==1.1.0
29+
pytest-cov==6.2.1
30+
pytest-mock==3.14.1
31+
pyyaml==6.0.2
32+
ray==2.48.0
33+
referencing==0.36.2
34+
requests==2.32.4
35+
rpds-py==0.26.0
36+
sortedcontainers==2.4.0
37+
tomli==2.2.1
38+
typing-extensions==4.14.1
39+
urllib3==2.5.0

.riot/requirements/1871782.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.9
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/1871782.in
6+
#
7+
attrs==25.3.0
8+
backports-asyncio-runner==1.2.0
9+
certifi==2025.7.14
10+
charset-normalizer==3.4.2
11+
click==8.1.8
12+
coverage[toml]==7.10.1
13+
exceptiongroup==1.3.0
14+
filelock==3.18.0
15+
hypothesis==6.45.0
16+
idna==3.10
17+
iniconfig==2.1.0
18+
jsonschema==4.25.0
19+
jsonschema-specifications==2025.4.1
20+
mock==5.2.0
21+
msgpack==1.1.1
22+
opentracing==2.4.0
23+
packaging==25.0
24+
pluggy==1.6.0
25+
protobuf==6.31.1
26+
pygments==2.19.2
27+
pytest==8.4.1
28+
pytest-asyncio==1.1.0
29+
pytest-cov==6.2.1
30+
pytest-mock==3.14.1
31+
pyyaml==6.0.2
32+
ray==2.48.0
33+
referencing==0.36.2
34+
requests==2.32.4
35+
rpds-py==0.26.0
36+
sortedcontainers==2.4.0
37+
tomli==2.2.1
38+
typing-extensions==4.14.1
39+
urllib3==2.5.0

.riot/requirements/1d453cc.txt

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.13
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/1d453cc.in
6+
#
7+
attrs==25.3.0
8+
certifi==2025.7.14
9+
charset-normalizer==3.4.2
10+
click==8.2.1
11+
coverage[toml]==7.10.1
12+
filelock==3.18.0
13+
hypothesis==6.45.0
14+
idna==3.10
15+
iniconfig==2.1.0
16+
jsonschema==4.25.0
17+
jsonschema-specifications==2025.4.1
18+
mock==5.2.0
19+
msgpack==1.1.1
20+
opentracing==2.4.0
21+
packaging==25.0
22+
pluggy==1.6.0
23+
protobuf==6.31.1
24+
pygments==2.19.2
25+
pytest==8.4.1
26+
pytest-asyncio==1.1.0
27+
pytest-cov==6.2.1
28+
pytest-mock==3.14.1
29+
pyyaml==6.0.2
30+
ray==2.48.0
31+
referencing==0.36.2
32+
requests==2.32.4
33+
rpds-py==0.26.0
34+
sortedcontainers==2.4.0
35+
urllib3==2.5.0

.riot/requirements/1d63829.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.10
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/1d63829.in
6+
#
7+
attrs==25.3.0
8+
backports-asyncio-runner==1.2.0
9+
certifi==2025.7.14
10+
charset-normalizer==3.4.2
11+
click==8.2.1
12+
coverage[toml]==7.10.1
13+
exceptiongroup==1.3.0
14+
filelock==3.18.0
15+
hypothesis==6.45.0
16+
idna==3.10
17+
iniconfig==2.1.0
18+
jsonschema==4.25.0
19+
jsonschema-specifications==2025.4.1
20+
mock==5.2.0
21+
msgpack==1.1.1
22+
opentracing==2.4.0
23+
packaging==25.0
24+
pluggy==1.6.0
25+
protobuf==6.31.1
26+
pygments==2.19.2
27+
pytest==8.4.1
28+
pytest-asyncio==1.1.0
29+
pytest-cov==6.2.1
30+
pytest-mock==3.14.1
31+
pyyaml==6.0.2
32+
ray==2.48.0
33+
referencing==0.36.2
34+
requests==2.32.4
35+
rpds-py==0.26.0
36+
sortedcontainers==2.4.0
37+
tomli==2.2.1
38+
typing-extensions==4.14.1
39+
urllib3==2.5.0

.riot/requirements/1dc06a0.txt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.11
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/1dc06a0.in
6+
#
7+
attrs==25.3.0
8+
certifi==2025.7.14
9+
charset-normalizer==3.4.2
10+
click==8.2.1
11+
coverage[toml]==7.10.1
12+
filelock==3.18.0
13+
hypothesis==6.45.0
14+
idna==3.10
15+
iniconfig==2.1.0
16+
jsonschema==4.25.0
17+
jsonschema-specifications==2025.4.1
18+
mock==5.2.0
19+
msgpack==1.1.1
20+
opentracing==2.4.0
21+
packaging==25.0
22+
pluggy==1.6.0
23+
protobuf==6.31.1
24+
pygments==2.19.2
25+
pytest==8.4.1
26+
pytest-asyncio==1.1.0
27+
pytest-cov==6.2.1
28+
pytest-mock==3.14.1
29+
pyyaml==6.0.2
30+
ray==2.48.0
31+
referencing==0.36.2
32+
requests==2.32.4
33+
rpds-py==0.26.0
34+
sortedcontainers==2.4.0
35+
typing-extensions==4.14.1
36+
urllib3==2.5.0

.riot/requirements/1e480a4.txt

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.13
3+
# by the following command:
4+
#
5+
# pip-compile --allow-unsafe --no-annotate .riot/requirements/1e480a4.in
6+
#
7+
attrs==25.3.0
8+
certifi==2025.7.14
9+
charset-normalizer==3.4.2
10+
click==8.2.1
11+
coverage[toml]==7.10.1
12+
filelock==3.18.0
13+
hypothesis==6.45.0
14+
idna==3.10
15+
iniconfig==2.1.0
16+
jsonschema==4.25.0
17+
jsonschema-specifications==2025.4.1
18+
mock==5.2.0
19+
msgpack==1.1.1
20+
opentracing==2.4.0
21+
packaging==25.0
22+
pluggy==1.6.0
23+
protobuf==6.31.1
24+
pygments==2.19.2
25+
pytest==8.4.1
26+
pytest-asyncio==1.1.0
27+
pytest-cov==6.2.1
28+
pytest-mock==3.14.1
29+
pyyaml==6.0.2
30+
ray==2.48.0
31+
referencing==0.36.2
32+
requests==2.32.4
33+
rpds-py==0.26.0
34+
sortedcontainers==2.4.0
35+
urllib3==2.5.0

0 commit comments

Comments
 (0)