Skip to content

Conversation

@puneetmatharu
Copy link
Contributor

@puneetmatharu puneetmatharu commented Dec 19, 2025

Summary of changes

  1. Update pytorch.yml to add ccache support which involves:

    • mounting a .ccache directory: CCACHE_HOST_DIR: ${{ github.workspace }}/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccache
    • caching it with actions/cache@v4
      • This bit is a bit nuanced; GitHub cache doesn't update the cache with the standard pattern after it changes (after the build process). It just restores the cache if a match for the cache key was found. We can either save + restore manually or use the simple single cache call and just change the key regularly. I opted for the latter; keeping up to two weeks of data at a time.
      • An alternative to above that's a little more complex but probably better is to manually save + restore a single cache and we ONLY save when the self-tests run on the main branch... (Now I say it out loud, this does sound better.)
      • GitHub cache wipes unused data after 7 days
      • I've opted to use 2G for the max ccache cache size; I found I used 1.5G locally, using ~3-4G for the two weeks of data. Let me know if we want to reduce this to, say, 1G.
  2. Updating build.sh and build-wheel.sh to support the use of ccache via the --use-ccache flag or USE_CCACHE environment variable.

  3. Adds PyTorch patch Gate deletion of clean-up steps in build_common.sh pytorch/pytorch#170600 for incremental builds as well as fast fresh builds.

Workflow

# Build and fill cache
# Option 1: Env var
USE_CCACHE=1 ./build.sh
# Option 2: CLI flag
./build.sh --use-ccache

# ... wipe repos or make changes ...

# Rebuild using ccache; should be much faster now
USE_CCACHE=1 ./build.sh [--use-existing-sources]
./build.sh --use-ccache [--use-existing-sources]

Review

Requesting review from:

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool to have, but I think it's a bit more complex than it needs to be.

@jondea
Copy link
Contributor

jondea commented Dec 19, 2025

Maybe I've misunderstood something, but why can't we load and save on every run? ccache will enforce the limit and shouldn't delete things it doesn't use

@puneetmatharu
Copy link
Contributor Author

puneetmatharu commented Dec 19, 2025

Maybe I've misunderstood something, but why can't we load and save on every run? ccache will enforce the limit and shouldn't delete things it doesn't use

The cache is immutable, so we can't overwrite it. We can only create a new cache entry. So we need to regularly generate a new cache entry which we can do based on the current week/year.

# container (and thus compilers) that we use to build the torch wheel. As such, you may not want
# to populate the global ccache cache with them. However, if you wish to do so, simply set
# CCACHE_HOST_DIR to that directory.
USE_CCACHE=${USE_CCACHE:-1}
Copy link
Contributor

@jondea jondea Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need an env var and a flag. Also, --enable-ccache never has any effect.

For single source of truth, can we just always test ! [[ "$*" == *--disable-ccache* ]]

Copy link
Contributor Author

@puneetmatharu puneetmatharu Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! Admittedly I'd started with just the environment variable and added the flag as it was more user-friendly but I think you're right. It's best to remove the environment variable and just use the --disable-ccache flag (since the default is to use ccache). I'll sort it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

version=$(cat $PYTORCH_HOST_DIR/version.txt| tr -d "[:space:]" )
OVERRIDE_PACKAGE_VERSION="${version%??}.dev${build_date}${TORCH_RELEASE_ID:+"+$TORCH_RELEASE_ID"}"

if [ "${USE_CCACHE}" == "1" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider whether this should be done just in CI in the steps rather than inside build-wheel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. It's better placed there. I'll take care of it -- thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

./get-source.sh
fi

# Use ccache either via env var or command line flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always just test ! [[ "$*" == *--disable-ccache* ]] then we don't need any of this brittle argument passing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I'll sort it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ccache_args=(-e USE_CCACHE="${USE_CCACHE}")
if [ "${USE_CCACHE}" == "1" ]; then
echo "Using ccache for build caching"
if [ -d "${CCACHE_HOST_DIR}" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have set -x, so I don't think this verbosity is necessary. Can we just do mkdir -p "${CCACHE_HOST_DIR}"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I'll change it, but it is a bit odd since we have

if [ -f "$TORCH_BUILD_CONTAINER_ID_FILE" ]; then
    TORCH_BUILD_CONTAINER=$(cat $TORCH_BUILD_CONTAINER_ID_FILE)
    echo "Found an existing torch build container id: $TORCH_BUILD_CONTAINER"
else
    TORCH_BUILD_CONTAINER=""
    echo "Did not find torch build container id in $(readlink -f $TORCH_BUILD_CONTAINER_ID_FILE), we will create one later"
fi

a little further below, which has essentially the same verbosity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more incentive for me to rip it out 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the comments, but I still think there are lines to be removed. I quite like bash, but I'm firmly of the opinion that every line of bash is a liability.

@puneetmatharu
Copy link
Contributor Author

Sorry for all the comments, but I still think there are lines to be removed. I quite like bash, but I'm firmly of the opinion that every line of bash is a liability.

Not a problem, they're all fair suggestions! I've made the relevant changes now

jondea
jondea previously approved these changes Dec 22, 2025
Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one nit, but I've put you through the wringer enough already for what is a very welcome DevEx improvement.

ccache_args+=(-e USE_CCACHE=0)
else
ccache_args+=(-e USE_CCACHE=1)
if [ ! -d "${CCACHE_HOST_DIR}" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkdir -p gracefully handles the case where the directory already exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! TIL. I thought it did. Will update now. Well spotted!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@puneetmatharu
Copy link
Contributor Author

On a positive note, it seems the GitHub cache entry (~327MB) is much smaller than what I found locally (1.5GB):

Run actions/cache@v5
Cache hit for restore-key: ccache-c7g-2025-51
Received 4194304 of 342481397 (1.2%), 4.0 MBs/sec
Received 96468992 of 342481397 (28.2%), 46.0 MBs/sec
Received 167772160 of 342481397 (49.0%), 53.3 MBs/sec
Received 251658240 of 342481397 (73.5%), 60.0 MBs/sec
Received 301989888 of 342481397 (88.2%), 57.6 MBs/sec
Received 342481397 of 342481397 (100.0%), 58.9 MBs/sec
Cache Size: ~327 MB (342481397 B)
/usr/bin/tar -xf /opt/actions-runner/_work/_temp/b68812ca-dc46-430e-9336-dd7a860ebb69/cache.tzst -P -C /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions --use-compress-program unzstd
Cache restored successfully
Cache restored from key: ccache-c7g-2025-51

so two weeks of cache data is less than 1GB which is fantastic.

@jondea
Copy link
Contributor

jondea commented Dec 22, 2025

That's odd, are we sure it's working? Is it decreasing build times?

@puneetmatharu
Copy link
Contributor Author

That's odd, are we sure it's working? Is it decreasing build times?

I just checked my cache stats again; it appears to be just 445MB locally. I think the 1.5GB figure was from before I added the ccache cache compression. Let's see how quickly the build runs now...

Signed-off-by: Puneet Matharu <[email protected]>
@puneetmatharu
Copy link
Contributor Author

puneetmatharu commented Dec 22, 2025

Below I've put some timings for the pipelines before and after this change.

Pipeline build time

The table below contains the (simplistic) time reported by GitHub for the "Build Tool-Solutions PyTorch" step of the CI.

MACHINE BEFORE AFTER SPEEDUP
c7g 26m 42s 13m 6s ~2.04x
c8g 20m 44s 11m 2s ~1.88x

However, a noticeable chunk of the time taken is spent cloning the sources and setting up the final docker container. A developer who can rebuild using pre-existing sources (as cloning also takes some time) should see an even smaller total build time. This is easier to see with the timings below.

torch wheel build (only)

The table below contains the time for torch wheel build step (only!); extracted by looking in the GitHub logs and comparing the timestamps before and after the build step. These values are therefore agnostic of the time taken to clone sources, build the torchao wheel, and set up the final Docker container.

MACHINE BEFORE AFTER SPEEDUP
c7g 21m 27s 7m 52s ~2.73x
c8g 16m 2s 6m 23s ~2.51x

Raw log data

Machine = c7g

BEFORE:

2025-12-19T05:07:57.4912526Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-19T05:29:24.9949176Z + docker exec d433437da5aab098305648a5de26bed03095315ad3005529de6ae35f235a45af chown -R 1000:1000 /pytorch /artifacts

= 21m 27s for the build.

AFTER:

2025-12-22T15:24:24.7723207Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -e USE_CCACHE=1 -e CCACHE_DIR=/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccache:/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-22T15:32:16.3202058Z + docker exec b2829d0c023e43826532433b784955b4431b85d17ccb8fb08006f2fce09a3607 chown -R 1000:1000 /pytorch /artifacts /.ccache

= 7m 52s for the build.

Machine = c8g

BEFORE:

2025-12-19T05:07:28.2441930Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-19T05:23:26.5323358Z + docker exec cdc076a8eb9feca25fc7a648bd70cbd8142f9d32b06629a9e1622912a99a6bc2 chown -R 1000:1000 /pytorch /artifacts

= 16m 2s for the build.

AFTER:

2025-12-22T15:24:00.4364761Z ++ docker run -t -d -e MAX_JOBS=30 -e OPENBLAS_VERSION=v0.3.30 -e ACL_VERSION=v52.6.0 -e BINARY_ENV_FILE=/tmp/env -e BUILD_ENVIRONMENT=linux-aarch64-binary-manywheel -e DESIRED_CUDA=cpu -e DESIRED_PYTHON=3.10 -e GITHUB_ACTIONS=0 -e GPU_ARCH_TYPE=cpu-aarch64 -e PACKAGE_TYPE=manywheel -e PYTORCH_FINAL_PACKAGE_DIR=/artifacts -e PYTORCH_ROOT=/pytorch -e SKIP_ALL_TESTS=1 -e OPENSSL_ROOT_DIR=/opt/openssl -e CMAKE_INCLUDE_PATH=/opt/openssl/include -e USE_CCACHE=1 -e CCACHE_DIR=/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/.ccache:/.ccache -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/pytorch:/pytorch -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/results:/artifacts -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/utils:/utils -v /opt/actions-runner/_work/Tool-Solutions/Tool-Solutions/Tool-Solutions/ML-Frameworks/pytorch-aarch64/../utils:/common_utils -w / pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-d8be0384e085f551506bd739678109fa0f5ee7ac
...
2025-12-22T15:30:23.1838663Z + docker exec 6770636835ac6138ff6ff96d6975d0f17d987551beafb68c3fc420cbf93ee197 chown -R 1000:1000 /pytorch /artifacts /.ccache

= 6m 23s for the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants