Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
1c8c535
init commit to start outlining container strategy proposal
nv-tusharma Jun 3, 2025
64f95a2
update container strategy proposal
nv-tusharma Jun 5, 2025
b0e99e0
updates to the container strategy
nv-tusharma Jun 5, 2025
5c38ad3
resize container build flow image
nv-tusharma Jun 5, 2025
66cf8e8
Resize image to better fit on screen
nv-tusharma Jun 5, 2025
0fe9448
Remove height parameter
nv-tusharma Jun 5, 2025
c7f34d0
Update strategy to build Dynamo in base container instead of in backe…
nv-tusharma Jun 11, 2025
8ed1fcd
Fix mermaid diagram in markdown
nv-tusharma Jun 11, 2025
75cbe69
nit: fix mermaid diiagram in proposal
nv-tusharma Jun 11, 2025
a29d831
Add descriptions to crosslinks in mermaid diagram instead of boxes
nv-tusharma Jun 11, 2025
d337ee3
Update container strategy to discuss container release process along …
nv-tusharma Jun 13, 2025
cb3368b
nit: fix minor formatting
nv-tusharma Jun 13, 2025
a9b9762
nit: move build spreed improvement to non-goal
nv-tusharma Jun 13, 2025
97b4a8f
nit: fix minor formatting
nv-tusharma Jun 13, 2025
827a5f5
update container strategy to add release info
nv-tusharma Jul 11, 2025
dbff0ea
nit: fix mermaid diagram
nv-tusharma Jul 11, 2025
e89d166
nit: fix mermaid diagram
nv-tusharma Jul 11, 2025
e7a3e59
nit: fix mermaid diagram
nv-tusharma Jul 11, 2025
342f361
nit: fix mermaid diagram
nv-tusharma Jul 11, 2025
0b13afc
nit: reorganize mermaid diagram
nv-tusharma Jul 11, 2025
b3b2f11
Add subgraphs for mermaid diagram
nv-tusharma Jul 11, 2025
3c7af80
nit: fix diagram formatting
nv-tusharma Jul 11, 2025
b70361f
nit: update mermaid subgraph
nv-tusharma Jul 11, 2025
30e5d85
Minor fixes + syntax fixing
nv-tusharma Jul 11, 2025
6d2bcee
nit: minor syntax fix
nv-tusharma Jul 11, 2025
1fbb42a
Split out container build process optimization and container release …
nv-tusharma Jul 15, 2025
7bc9681
Re-add the complete template into the MD file
nv-tusharma Jul 15, 2025
831778a
nit: add extra line
nv-tusharma Jul 15, 2025
595b0d2
Update NNNN-complete-template.md
nv-tusharma Jul 15, 2025
747fcc4
Minor updates to motivation section
nv-tusharma Jul 15, 2025
e2e7cab
Set to RFR and update Review data
nv-tusharma Jul 15, 2025
ada82ac
Rename title
nv-tusharma Jul 15, 2025
696e49a
Update Review data
nv-tusharma Jul 15, 2025
a935124
Merge branch 'main' into tusharma/container-strategy
nv-tusharma Jul 15, 2025
c8081b3
Revert changes to NNNN-complete-template.md
nv-tusharma Jul 15, 2025
fd97542
Address current comments and concerns called out
nv-tusharma Jul 22, 2025
a98a776
nit: minor improvements
nv-tusharma Jul 22, 2025
e180e78
nit: minor updates to strategy
nv-tusharma Sep 2, 2025
d3d252a
minor fixes to update recent changes in container strateegy
nv-tusharma Sep 19, 2025
fe9b037
remove deferred to implementation section
nv-tusharma Sep 19, 2025
f6a596a
set to remote compiler caching strategies
nv-tusharma Sep 19, 2025
cf2cf59
Remove ci_minimum stage from header
nv-tusharma Sep 19, 2025
de448a2
add dep id to container strategy DEP
nv-tusharma Sep 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions deps/NNNN-container-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Container Strategy

**Status**: Draft

**Authors**: [nv-tusharma]

**Category**: Architecture

**Replaces**: N/A

**Replaced By**: N/A

**Sponsor**: saturley-hall, nv-anants

**Required Reviewers**: nnshah1, saturley-hall, nv-anants, nvda-mesharma, mc-nv, dmitry-tokarev-nv, pvijayakrish

**Review Date**: 2025-06-03

**Pull Request**: TBD

**Implementation PR / Tracking Issue**: TBD

# Summary

This document outlines a container strategy for Dynamo to enhance the developer experience by
organizing Dockerfiles to maximize coverage and reuse. The primary goal for this document is to define a clear and maintainable structure for our Dockerfiles—specifically, to determine how many Dockerfiles we need and clarify the relationships between base, runtime, development, and CI images. The aim is to ensure each environment's Dockerfile builds upon the previous (as supersets), maximizing environment consistency and coverage during daily development and testing.
To achieve this goal, this document proposes certain optimizations to improve the current build process:
- Restructuring the build process to provide a build-base container which contains all build dependencies, enabling specific backends to use the build base container to build the final binary.
- Defining a structure/template for all Dockerfiles to follow to ensure consistent and reproducible builds across backends along with specific roles/use cases targeted for each stage.

# Motivation

Dynamo is primarily built from a collection of Dockerfiles hosted in the /containers directory of the [Dynamo repository](https://github.com/NVIDIA/Dynamo). Dockerfiles are split by backends (vLLM, sglang, TRT-LLM) and each Dockerfile contains multiple stages
(base, devel, ci, runtime) to account for different purposes. Each stage essentially provides a Dynamo build along with the specific backend (vLLM, TRT-LLM, etc) and NIXL, the high-throughput, low-latency point-to-point communication library used by Dynamo to accelerate inference.
This approach has several drawbacks, including:
1. Inefficient Build Times: Components such as Dynamo, NIXL, and the selected backend are rebuilt multiple times across stages, instead of leveraging a layered, superset structure. For instance, Dynamo is installed three separate times in the Dockerfile.vllm—once each in the base, ci_minimum, and runtime stages.
2. Poor Developer Experience: The lack of clear organization among Dockerfiles makes it difficult for developers to identify which build suits their needs. As a result, the devel build is often used by default, regardless of the use case.
3. Flaky Builds: Due to the large number of layers along with multiple repeated steps across stages, builds can fail intermittently resulting in flaky builds.
4. Lack of standardization across Dockerfiles: Currently, there is not a single, stand-alone Dockerfile to build Dynamo, NIXL, and dynamo dependencies resulting in duplicated/missing code across multiple Dockerfiles. Optimizations applied to one backend's Dockerfile are not immediately available to other backend-specific Dockerfiles.

As Dynamo continues to scale to support multiple LLM backends along with efforts to provide pre-built Docker containers for external usage, we need to define a structure to our Dockerfiles to improve container usability.


## Goals

* Remove duplicate code in current dockerfile implementations and define a single build base image containing all the necessary dependencies to build Dynamo/NIXL specific dependencies.

This build-base image should operate as a single base container which can then be used as base containers for backend-specific images. By leveraging a build base container, We can reduce the redundant code across Dockerfiles and establish a single-source of truth for all Dynamo-builds.

* Define the relationships between base, runtime, development, and CI images for each Dockerfile and provide a structure/template to follow for Dockerfiles.

* Reduce build flakiness by pinning/fixing dependencies in the base image from package managers and squashing/reducing layers as necessary

Pinning/Fixing dependencies will ensure a unified build environment reducing "it works on my machine" problems or "this worked yesterday"

* Outline possible further improvements including external caching/multi-context docker builds to reduce build times.

### Non Goals

- Slim backend-specific runtime containers to use for performance testing.
- Unified build environment


## Requirements

### REQ \<\#1\> \<Backend Integration with Base Container\>
The build-base container must be designed such that backend-specific Dockerfiles can integrate with it with minimal changes to their existing build process. This includes:
- Clear documentation on how to use the base container
- Standardized environment variables and paths

### REQ \<\#2\> \<Layered Container Structure\>
Dockerfiles must follow a layered, super-set structure to optimize build efficiency:
- Each stage should build upon the previous stage
- Artifacts should be built only once and reused across stages
- Clear separation between build-time and runtime dependencies
- Minimal layer count to reduce build complexity

### REQ \<\#3\> \<Stage Purpose Definition\>
Each build stage must have a clearly defined purpose and scope:
- Base: Common build dependencies and tools
- Development: Additional debugging and development tools
- Runtime: Minimal production deployment requirements
- CI: Testing tools and validation requirements



# Proposal

In order to address the requirements, we propose the following changes to the Dynamo build process:

## Build-Base Container

The build-base container will be a pre-built container that will be used by the backends to build the final container image. This build base container will contain all the necessary dependencies to build Dynamo. The dependencies should either be pinned or fixed to a particular commit SHA to promote reproducibility. The container will also include a NIXL build + NATS + ETCD installation since this is common across all backends. We will create a new Dockerfile in the /containers directory for this container and provide the image through our CI registry for developers to use for local development.

## Use-case of build stages along with relationship between stages (base, runtime, devel, ci_minimum)

Each backend-specific Dockerfile should follow a specific format. The backend-specific Dockerfiles should be divided up into multiple stages, with each stage inheriting artifacts/leveraging the previous stage as the base container. The following stages should be defined in the backend-specific Dockerfile:

| Stage | Targeted User | Base Image | Functionality |
|----------|---------------------|----------------------|----------------------------------------------------------------------------------------------------------------------|
| Devel | Developers | Dynamo Build base image | Builds targeted backend and Dynamo; includes development tools for debugging and continuous development. |
| Runtime | Customers/Production| Cuda base runtime image| Minimal image with only the dependencies required to deploy and run Dynamo; intended for production deployments. |
| CI | Internal CI Pipelines/Local CI Debugging | Runtime image | Adds CI-specific tools, QA test scripts, internal models, and other dependencies needed for automated testing. |


# Implementation Details

## Container Build Flow

<img src="container_strategy_proposal.png" width="600" alt="Container Strategy Diagram" style="object-fit: contain;">

The diagram above illustrates the proposed container strategy showing the relationships between:
- Build Base Container with common dependencies
- Backend-specific development containers
- Runtime containers
- CI containers

This layered approach ensures consistent builds, reduces duplication, and improves maintainability across all backend implementations.


## Deferred to Implementation

TBD

# Implementation Phases

## Phase \<\#1\> \<Build Base Container Development\>

**Release Target**: TBD

**Release Target**: Date

**Effort Estimate**: \<estimate of time and number of engineers to complete the phase\>

**Work Item(s):** \<one or more links to github issues\>

**Supported API / Behavior:**

* \<name and concise description of the API / behavior\>

**Not Supported:**

* \<name and concise description of the API / behavior\>

## Phase \<\#2\> \<Restructure backend Dockerfiles to follow proposed structure\>

**Release Target**: Date

**Effort Estimate**: \<estimate of time and number of engineers to complete the phase\>

**Work Item(s):** \<one or more links to github issues\>

**Supported API / Behavior:**

* \<name and concise description of the API / behavior\>

**Not Supported:**

* \<name and concise description of the API / behavior\>

# Related Proposals

**\[Optional \- if not applicable omit\]**

* File

* File

* File

* File

* File

# Alternate Solutions

**\[Required, if not applicable write N/A\]**

List out solutions that were considered but ultimately rejected. Consider free form \- but a possible format shown below.

## Alt \<\#\> \<Title\>

**Pros:**

\<bulleted list or pros describing the positive aspects of this solution\>

**Cons:**

\<bulleted list or pros describing the negative aspects of this solution\>

**Reason Rejected:**

\<bulleted list or pros describing why this option was not used\>

**Notes:**

\<optional: additional comments about this solution\>

# Background

**\[Optional \- if not applicable omit\]**

Add additional context and references as needed to help reviewers and authors understand the context of the problem and solution being proposed.

## References

**\[Optional \- if not applicable omit\]**

Add additional references as needed to help reviewers and authors understand the context of the problem and solution being proposed.

* \<hyper-linked title of an external reference resource\>

## Terminology & Definitions

**\[Optional \- if not applicable omit\]**

List out additional terms / definitions (lexicon). Try to keep definitions as concise as possible and use links to external resources when additional information would be useful to the reader.

Keep the list of terms sorted alphabetically to ease looking up definitions by readers.

| \<Term\> | \<Definition\> |
| :---- | :---- |
| **\<Term\>** | \<Definition\> |

## Acronyms & Abbreviations

**\[Optional \- if not applicable omit\]**

Provide a list of frequently used acronyms and abbreviations which are uncommon or unlikely to be known by the reader. Do not include acronyms or abbreviations which the reader is likely to be familiar with.

Keep the list of acronyms and abbreviations sorted alphabetically to ease looking up definitions by readers.

Do not include the full definition in the expanded meaning of an abbreviation or acronym. If the reader needs the definition, please include it in the [Terminology & Definitions](#terminology--definitions) section.

**\<Acronym/Abbreviation\>:** \<Expanded Meaning\>

Binary file added deps/container_strategy_proposal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.