-
Notifications
You must be signed in to change notification settings - Fork 148
Add ccache support
#415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ccache support
#415
Conversation
d833f1d to
15613eb
Compare
jondea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool to have, but I think it's a bit more complex than it needs to be.
|
Maybe I've misunderstood something, but why can't we load and save on every run? ccache will enforce the limit and shouldn't delete things it doesn't use |
The cache is immutable, so we can't overwrite it. We can only create a new cache entry. So we need to regularly generate a new cache entry which we can do based on the current week/year. |
15613eb to
9a3eebb
Compare
| # container (and thus compilers) that we use to build the torch wheel. As such, you may not want | ||
| # to populate the global ccache cache with them. However, if you wish to do so, simply set | ||
| # CCACHE_HOST_DIR to that directory. | ||
| USE_CCACHE=${USE_CCACHE:-1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need an env var and a flag. Also, --enable-ccache never has any effect.
For single source of truth, can we just always test ! [[ "$*" == *--disable-ccache* ]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Admittedly I'd started with just the environment variable and added the flag as it was more user-friendly but I think you're right. It's best to remove the environment variable and just use the --disable-ccache flag (since the default is to use ccache). I'll sort it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| version=$(cat $PYTORCH_HOST_DIR/version.txt| tr -d "[:space:]" ) | ||
| OVERRIDE_PACKAGE_VERSION="${version%??}.dev${build_date}${TORCH_RELEASE_ID:+"+$TORCH_RELEASE_ID"}" | ||
|
|
||
| if [ "${USE_CCACHE}" == "1" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider whether this should be done just in CI in the steps rather than inside build-wheel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. It's better placed there. I'll take care of it -- thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ./get-source.sh | ||
| fi | ||
|
|
||
| # Use ccache either via env var or command line flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we always just test ! [[ "$*" == *--disable-ccache* ]] then we don't need any of this brittle argument passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I'll sort it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ccache_args=(-e USE_CCACHE="${USE_CCACHE}") | ||
| if [ "${USE_CCACHE}" == "1" ]; then | ||
| echo "Using ccache for build caching" | ||
| if [ -d "${CCACHE_HOST_DIR}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have set -x, so I don't think this verbosity is necessary. Can we just do mkdir -p "${CCACHE_HOST_DIR}"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll change it, but it is a bit odd since we have
if [ -f "$TORCH_BUILD_CONTAINER_ID_FILE" ]; then
TORCH_BUILD_CONTAINER=$(cat $TORCH_BUILD_CONTAINER_ID_FILE)
echo "Found an existing torch build container id: $TORCH_BUILD_CONTAINER"
else
TORCH_BUILD_CONTAINER=""
echo "Did not find torch build container id in $(readlink -f $TORCH_BUILD_CONTAINER_ID_FILE), we will create one later"
fia little further below, which has essentially the same verbosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more incentive for me to rip it out 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
jondea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all the comments, but I still think there are lines to be removed. I quite like bash, but I'm firmly of the opinion that every line of bash is a liability.
9a3eebb to
d3c6afe
Compare
Not a problem, they're all fair suggestions! I've made the relevant changes now |
jondea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one nit, but I've put you through the wringer enough already for what is a very welcome DevEx improvement.
| ccache_args+=(-e USE_CCACHE=0) | ||
| else | ||
| ccache_args+=(-e USE_CCACHE=1) | ||
| if [ ! -d "${CCACHE_HOST_DIR}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkdir -p gracefully handles the case where the directory already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! TIL. I thought it did. Will update now. Well spotted!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
On a positive note, it seems the GitHub cache entry (~327MB) is much smaller than what I found locally (1.5GB): so two weeks of cache data is less than 1GB which is fantastic. |
|
That's odd, are we sure it's working? Is it decreasing build times? |
d3c6afe to
d34c703
Compare
I just checked my cache stats again; it appears to be just 445MB locally. I think the 1.5GB figure was from before I added the |
Signed-off-by: Puneet Matharu <[email protected]>
d34c703 to
40a3b9b
Compare
|
Below I've put some timings for the pipelines before and after this change. Pipeline build timeThe table below contains the (simplistic) time reported by GitHub for the "Build Tool-Solutions PyTorch" step of the CI.
However, a noticeable chunk of the time taken is spent cloning the sources and setting up the final docker container. A developer who can rebuild using pre-existing sources (as cloning also takes some time) should see an even smaller total build time. This is easier to see with the timings below.
|
| MACHINE | BEFORE | AFTER | SPEEDUP |
|---|---|---|---|
| c7g | 21m 27s | 7m 52s | ~2.73x |
| c8g | 16m 2s | 6m 23s | ~2.51x |
Raw log data
Machine = c7g
BEFORE:
2025-12-19T05:07:57.4912526Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-19T05:29:24.9949176Z + docker exec d433437da5aab098305648a5de26bed03095315ad3005529de6ae35f235a45af chown -R 1000:1000 /pytorch /artifacts= 21m 27s for the build.
AFTER:
2025-12-22T15:24:24.7723207Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -e USE_CCACHE=1 -e CCACHE_DIR=/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccache:/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-22T15:32:16.3202058Z + docker exec b2829d0c023e43826532433b784955b4431b85d17ccb8fb08006f2fce09a3607 chown -R 1000:1000 /pytorch /artifacts /.ccache= 7m 52s for the build.
Machine = c8g
BEFORE:
2025-12-19T05:07:28.2441930Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-19T05:23:26.5323358Z + docker exec cdc076a8eb9feca25fc7a648bd70cbd8142f9d32b06629a9e1622912a99a6bc2 chown -R 1000:1000 /pytorch /artifacts= 16m 2s for the build.
AFTER:
2025-12-22T15:24:00.4364761Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -e USE_CCACHE=1 -e CCACHE_DIR=/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccache:/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-22T15:30:23.1838663Z + docker exec 6770636835ac6138ff6ff96d6975d0f17d987551beafb68c3fc420cbf93ee197 chown -R 1000:1000 /pytorch /artifacts /.ccache= 6m 23s for the build.
Summary of changes
Update
pytorch.ymlto add ccache support which involves:.ccachedirectory:CCACHE_HOST_DIR: ${{ github.workspace }}/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccacheactions/cache@v4mainbranch... (Now I say it out loud, this does sound better.)ccachecache size; I found I used 1.5G locally, using ~3-4G for the two weeks of data. Let me know if we want to reduce this to, say, 1G.Updating
build.shandbuild-wheel.shto support the use ofccachevia the--use-ccacheflag orUSE_CCACHEenvironment variable.Adds PyTorch patch Gate deletion of clean-up steps in
build_common.shpytorch/pytorch#170600 for incremental builds as well as fast fresh builds.Workflow
Review
Requesting review from: