-
Notifications
You must be signed in to change notification settings - Fork 5
feat: Container build process improvements to reduce build times and improve Dockerfile organization #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
deps/NNNN-container-strategy.md
Outdated
### Non Goals | ||
|
||
- Slim backend-specific runtime containers to use for performance testing. | ||
- Unified build environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we wanna add something about api-store and other k8 containers here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not able to find this, is this updated?
…nd specific build
…with discussions around multi-arch support
…process into seperate DEPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm! 🚀
|
||
## Container Build Flow | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider a path for doing editable build in framework runtime containers. After build, there should be a path for mount local repo checkout in container, do pip editable install to replace wheel. Technically nothing to change in the build as such, but would be nice to include it in the DEP on how to do it.
whenever we introduce a new container, customer has to do security scans, QA, etc. Do we want to reconsider the strategy to have backend specific containers? |
|
||
**Not Supported:** | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add something about future caching we can leverage to speed things up, like sccache for rust builds. Not urgent, but would be list out some future optimizations
@nvda-mesharma It makes sense to have a single container to reduce complexity but there are a few things to consider with this approach:
I think providing a Dynamo base image with minimal dependencies is sufficient for now but I'll add this suggestion in the Alternate Solutions section. |
@nv-tusharma please add an appropriate PR description. |
@@ -0,0 +1,338 @@ | |||
# Dynamo Container Build Process Improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we require a copyright header for DEP written by NVIDIA employees?
deps/NNNN-container-strategy.md
Outdated
|
||
## Terminology & Definitions | ||
|
||
| Term | Definition | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to keep the markdown readable as plain-text.
| Term | Definition |
| :----------------- | :--------------------------------------------------------------------------------------------- |
| **Base Container** | Pre-built container with Dynamo and NIXL that serves as foundation for backend-specific builds |
| **Backend Build** | Container stage that builds backend-specific code and dependencies |
| **sccache** | Compiler cache tool that speeds up recompilation by caching previous compilation results |
| **CI Stage** | Container stage with testing tools and validation requirements |
| **Manylinux** | PEP 513 standard for Linux wheel distribution compatibility |
| **NIXL** | High-throughput, low-latency point-to-point communication library for accelerating inference |
| **Runtime Stage** | Minimal container stage with only production deployment requirements |
|
||
This document proposes solutions to the build process challenges, aiming to improve overall container build efficiency and developer experience. | ||
|
||
## Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan between github and gitlab?
How do we make sure we share a cache and that it is clear where builds should occur. I think we should do all building in github as it relates to CI/CD.
Gitlab in my opinion should only be for benchmarking and should just reference containers that we build from github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvda-mesharma , @nv-tusharma - is it feasible to build all containers on github runners and then pull them into gitlab - that might be better discussed in test strategy than container strategy though -
lets scope this to how things are build and run -
Need to add some information about potentially publishing a base container which has NIXL + UCX pinned that we can build dynamo on top off. Since nixl is pretty much locked and does not change often and similar for other packages -- we could have a separate workflow that we can use to publish a public image that we can build on top of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great. Added a comment about the readability of the Mermaid graph when using specified colors.
deps/NNNN-container-strategy.md
Outdated
E[Dynamo Base Container]:::gray | ||
F[Build Backend from source] | ||
G[Backend Build Image]:::gray | ||
J[Cuda Runtime<br/>nvcr.io/nvidia/cuda.XX.YY-runtime]:::gray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this or not is your call. I just wanted to call this out and provide the information.
|
||
The base container will be a pre-built container that will be used by the backends to build the final container image. This build base container will contain a Dynamo build for all backends to use for their framework-specific build. The base image will leverage a manylinux base image to enable support for multiple distributions (U22, U24, etc). The container will also include a NIXL build since this is common across all backends. This will be a new Dockerfile in the /containers directory. Multi-arch support is also a P0 Also, the base container can be used as a drop-in replacement for the current devel container used in public CI. This would significantly reduce public CI build times and enable faster validation of changes. | ||
|
||
## Use-case of build stages along with relationship between stages (base, runtime, devel, ci_minimum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to match with latest changes - can we add the dev stage here?
LGTM, I'm happy with what we have here. |
|
||
### Implementation Considerations | ||
- **Cache Invalidation**: Implement smart cache invalidation based on dependency changes and version updates | ||
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities | |
- **Monitoring**: Collect build time metrics to measure cache effectiveness and identify optimization opportunities |
### Implementation Considerations | ||
- **Cache Invalidation**: Implement smart cache invalidation based on dependency changes and version updates | ||
- **Monitoring**: Add build time metrics to measure cache effectiveness and identify optimization opportunities | ||
- **CI Integration**: Configure CI/CD pipelines to properly utilize remote caching storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context of the document include Docker and compiler cache, can we be more specific on this task?
|
||
## Deferred to Implementation | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused with this section... may be redundant
|
||
Pinning/Fixing dependencies will ensure a unified build environment reducing "it works on my machine" problems or "this worked yesterday" | ||
|
||
* Minimize effort for providing multi-arch support across various backends for Dynamo by leveraging manylinux to build for multiple distributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this effort elaborated somewhere ?
|
||
* Minimize effort for providing multi-arch support across various backends for Dynamo by leveraging manylinux to build for multiple distributions | ||
|
||
* Implement remote compiler caching to dramatically reduce build times across development and CI/CD environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one could be plural, especially if we use different build tools it may use different endpoints.
|
||
### REQ \<\#1\> \<Backend Integration with Base Container\> | ||
The build-base container must be designed such that backend-specific Dockerfiles can integrate with it with minimal changes to their existing build process. This includes: | ||
- Multi-arch support is a P0. The Base container should be able to support both x84_64 and arm64 builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struggling to understand this point, may I know what kind of support we are looking for?
|
||
### REQ \<\#2\> \<Layered Container Structure\> | ||
Dockerfiles must follow a layered, super-set structure to optimize build efficiency: | ||
- Each stage should build upon the previous stage or use artifacts from the previous stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't Docker build doesn't care about order if the cache available and concurrency possible isn't docker will try to utilize it?
Dockerfiles must follow a layered, super-set structure to optimize build efficiency: | ||
- Each stage should build upon the previous stage or use artifacts from the previous stage | ||
- Artifacts should be built only once and reused across stages | ||
- Clear separation between build-time and runtime dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a packaging issue IMO, do we have a conversation with developers about it?
This DEP introduces a container build process optimization strategy for Dynamo to enhance developer experience and build efficiency by reorganizing Dockerfiles. We are introducing a new Dynamo base container which contains all the build logic to build Dynamo + NIXL wheels along with an executable base container (with Dynamo + NIXL pre-installed) that can be used for CI. We also define a structure for the backend-specific Dockerfiles and introduce other build improvements such as external caching.