Skip to content

Conversation

nv-tusharma
Copy link

@nv-tusharma nv-tusharma commented Jun 5, 2025

This DEP introduces a container build process optimization strategy for Dynamo to enhance developer experience and build efficiency by reorganizing Dockerfiles. We are introducing a new Dynamo base container which contains all the build logic to build Dynamo + NIXL wheels along with an executable base container (with Dynamo + NIXL pre-installed) that can be used for CI. We also define a structure for the backend-specific Dockerfiles and introduce other build improvements such as external caching.

@nv-tusharma nv-tusharma requested review from nv-anants and a team June 5, 2025 20:11
### Non Goals

- Slim backend-specific runtime containers to use for performance testing.
- Unified build environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we wanna add something about api-store and other k8 containers here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not able to find this, is this updated?

@nv-tusharma nv-tusharma changed the title feat: container strategy proposal feat: container build process improvements Jul 15, 2025
@nv-tusharma nv-tusharma changed the title feat: container build process improvements feat: Container build process improvements to reduce build times and improve Dockerfile organization Jul 15, 2025
Copy link

@biswapanda biswapanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! 🚀

@nv-tusharma nv-tusharma requested review from nnshah1 and removed request for nnshah1 July 21, 2025 23:27

## Container Build Flow

```mermaid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider a path for doing editable build in framework runtime containers. After build, there should be a path for mount local repo checkout in container, do pip editable install to replace wheel. Technically nothing to change in the build as such, but would be nice to include it in the DEP on how to do it.

@nvda-mesharma
Copy link

whenever we introduce a new container, customer has to do security scans, QA, etc. Do we want to reconsider the strategy to have backend specific containers?
LLM NIM has been doing this for a year now, one container with trt-llm, vllm, and sglang. Shall we also create a container with all backends included?


**Not Supported:**


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add something about future caching we can leverage to speed things up, like sccache for rust builds. Not urgent, but would be list out some future optimizations

@nv-tusharma
Copy link
Author

nv-tusharma commented Jul 22, 2025

whenever we introduce a new container, customer has to do security scans, QA, etc. Do we want to reconsider the strategy to have backend specific containers?
LLM NIM has been doing this for a year now, one container with trt-llm, vllm, and sglang. Shall we also create a container with all backends included?

@nvda-mesharma It makes sense to have a single container to reduce complexity but there are a few things to consider with this approach:

  • Is it feasible for a user to want deploy multiple inference engines at once with dynamo?
  • Container size, if a user is only interested in one particular backend, we should remove the dependencies associated with other backends. We would need to provide tools for users to create backend-specific containers for their deployment.
  • It is expected that Backends can differ on performance, which could be a result of dependencies.
  • Build times, if we have to build each backend from source in one container, our build time would increase (since layers can't be parallelized without docker buildkit)

I think providing a Dynamo base image with minimal dependencies is sufficient for now but I'll add this suggestion in the Alternate Solutions section.

@whoisj
Copy link

whoisj commented Jul 23, 2025

@nv-tusharma please add an appropriate PR description.

@@ -0,0 +1,338 @@
# Dynamo Container Build Process Improvements
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we require a copyright header for DEP written by NVIDIA employees?


## Terminology & Definitions

| Term | Definition |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to keep the markdown readable as plain-text.

| Term               | Definition                                                                                     |
| :----------------- | :--------------------------------------------------------------------------------------------- |
| **Base Container** | Pre-built container with Dynamo and NIXL that serves as foundation for backend-specific builds |
| **Backend Build**  | Container stage that builds backend-specific code and dependencies                             |
| **sccache**        | Compiler cache tool that speeds up recompilation by caching previous compilation results       |
| **CI Stage**       | Container stage with testing tools and validation requirements                                 |
| **Manylinux**      | PEP 513 standard for Linux wheel distribution compatibility                                    |
| **NIXL**           | High-throughput, low-latency point-to-point communication library for accelerating inference   |
| **Runtime Stage**  | Minimal container stage with only production deployment requirements                           |


This document proposes solutions to the build process challenges, aiming to improve overall container build efficiency and developer experience.

## Goals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan between github and gitlab?

How do we make sure we share a cache and that it is clear where builds should occur. I think we should do all building in github as it relates to CI/CD.

Gitlab in my opinion should only be for benchmarking and should just reference containers that we build from github.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvda-mesharma , @nv-tusharma - is it feasible to build all containers on github runners and then pull them into gitlab - that might be better discussed in test strategy than container strategy though -

lets scope this to how things are build and run -

@nv-tusharma
Copy link
Author

Need to add some information about potentially publishing a base container which has NIXL + UCX pinned that we can build dynamo on top off. Since nixl is pretty much locked and does not change often and similar for other packages -- we could have a separate workflow that we can use to publish a public image that we can build on top of.
Not sure if we can do that on github or ngc, but something like -
Base image -- cuda+nixl+uv+cargo+..... We use this to build dynamo and other things
Dynamo image -- base image + dynamo rust/python builds
Framework image -- dynamo image + framework builds

Copy link

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great. Added a comment about the readability of the Mermaid graph when using specified colors.

E[Dynamo Base Container]:::gray
F[Build Backend from source]
G[Backend Build Image]:::gray
J[Cuda Runtime<br/>nvcr.io/nvidia/cuda.XX.YY-runtime]:::gray
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selecting a color is always risky due to user's local experience.

example:

Image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this or not is your call. I just wanted to call this out and provide the information.


The base container will be a pre-built container that will be used by the backends to build the final container image. This build base container will contain a Dynamo build for all backends to use for their framework-specific build. The base image will leverage a manylinux base image to enable support for multiple distributions (U22, U24, etc). The container will also include a NIXL build since this is common across all backends. This will be a new Dockerfile in the /containers directory. Multi-arch support is also a P0 Also, the base container can be used as a drop-in replacement for the current devel container used in public CI. This would significantly reduce public CI build times and enable faster validation of changes.

## Use-case of build stages along with relationship between stages (base, runtime, devel, ci_minimum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to match with latest changes - can we add the dev stage here?

@whoisj
Copy link

whoisj commented Sep 3, 2025

LGTM, I'm happy with what we have here.


### Implementation Considerations
- **Cache Invalidation**: Implement smart cache invalidation based on dependency changes and version updates
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities
- **Monitoring**: Collect build time metrics to measure cache effectiveness and identify optimization opportunities

### Implementation Considerations
- **Cache Invalidation**: Implement smart cache invalidation based on dependency changes and version updates
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities
- **CI Integration**: Configure CI/CD pipelines to properly utilize remote caching storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context of the document include Docker and compiler cache, can we be more specific on this task?


## Deferred to Implementation

TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused with this section... may be redundant


Pinning/Fixing dependencies will ensure a unified build environment reducing "it works on my machine" problems or "this worked yesterday"

* Minimize effort for providing multi-arch support across various backends for Dynamo by leveraging manylinux to build for multiple distributions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this effort elaborated somewhere ?


* Minimize effort for providing multi-arch support across various backends for Dynamo by leveraging manylinux to build for multiple distributions

* Implement remote compiler caching to dramatically reduce build times across development and CI/CD environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one could be plural, especially if we use different build tools it may use different endpoints.


### REQ \<\#1\> \<Backend Integration with Base Container\>
The build-base container must be designed such that backend-specific Dockerfiles can integrate with it with minimal changes to their existing build process. This includes:
- Multi-arch support is a P0. The Base container should be able to support both x84_64 and arm64 builds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struggling to understand this point, may I know what kind of support we are looking for?


### REQ \<\#2\> \<Layered Container Structure\>
Dockerfiles must follow a layered, super-set structure to optimize build efficiency:
- Each stage should build upon the previous stage or use artifacts from the previous stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't Docker build doesn't care about order if the cache available and concurrency possible isn't docker will try to utilize it?

Dockerfiles must follow a layered, super-set structure to optimize build efficiency:
- Each stage should build upon the previous stage or use artifacts from the previous stage
- Artifacts should be built only once and reused across stages
- Clear separation between build-time and runtime dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a packaging issue IMO, do we have a conversation with developers about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants