Skip to content

Conversation

wstcliyu
Copy link
Collaborator

Description

This PR enables TPU unit tests to also run with Pathways backend. Essentially, we will have two sets of tests - one with McJAX and one with Pathways.

  • This change being made to ensure feature parity between Pathways and McJAX.
  • The tests run as part of a docker compose script which sets up the Pathways containers along with Maxtext. (Github Actions didn't have enough support for deploying Pathways containers as "service containers".)
  • TPU integration tests may be also run with Pathways backend in the future.

For more details, please read the doc on b/397475777. Note that extra self-hosted runners have been added so that tests can be executed in parallel and complete faster overall.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

  1. Changes tested locally using command bash docker_run_pathways_containers.sh maxtext_image=us-docker.pkg.dev/cloud-tpu-v2-images-dev/pathways/maxtext_jax_stable:latest command="cd MaxText ; python3 -m pytest tests -m 'not gpu_only and not integration_test' -s"

  2. Pathways flow tested on Github workflow -
    Example runs -

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Comment on lines 52 to 102
runs-on: ${{ inputs.cloud_runner != '' && inputs.cloud_runner || fromJson(format('["self-hosted", "{0}", "{1}"]', inputs.device_type, inputs.device_name)) }}
container:
image: gcr.io/tpu-prod-env-multipod/maxtext_${{ github.run_id }}:${{ inputs.image_type != '' && inputs.image_type || inputs.device_type }}
volumes:
- /home/runner/actions-runner/_work/maxtext/maxtext:/deps
env:
XLA_PYTHON_CLIENT_MEM_FRACTION: ${{ inputs.xla_python_client_mem_fraction }}
TF_FORCE_GPU_ALLOW_GROWTH: ${{ inputs.tf_force_gpu_allow_growth }}
JAX_PLATFORMS: "proxy"
JAX_BACKEND_TARGET: "grpc://proxy:29008"
options: ${{ inputs.container_resource_option }}
steps:
- uses: actions/checkout@v4
- name: Run Tests
run: |
if [ "${{ inputs.is_scheduled_run }}" = "true" ]; then
FINAL_PYTEST_MARKER="${{ inputs.pytest_marker }}"
else
FINAL_PYTEST_MARKER="${{ inputs.pytest_marker }} and not scheduled_only"
fi
python3 -m pip install -e . --no-dependencies &&
python3 -m pytest -v -m "${FINAL_PYTEST_MARKER}" --durations=0

services:
resource_manager:
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:latest
ports:
- "29001:29001"
- "29002:29002"
options:
--entrypoint=[/usr/pathways/run/cloud_pathways_server_sanitized, --server_port=29001, --node_type=resource_manager, --instance_count=1, --instance_type=tpuv4:2x2x1, --gcs_scratch_location=gs://cloud-pathways-staging/tmp]
env:
HOST_ADDRESS: resource_manager
TPU_SKIP_MDS_QUERY: true

worker:
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:latest
ports:
- "29005:29005"
- "29006:29006"
- "8471:8471"
- "8080:8080"
options:
--privileged
--entrypoint=[/usr/pathways/run/cloud_pathways_server_sanitized, --server_port=29005, --resource_manager_address=resource_manager:29001, --gcs_scratch_location=gs://cloud-pathways-staging/tmp]

proxy:
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:latest
ports:
- "29000:29000"
options:
--entrypoint=[/usr/pathways/run/cloud_proxy_server_sanitized, --server_port=29000, --resource_manager_address=resource_manager:29001, --gcs_scratch_location=gs://cloud-pathways-staging/tmp]

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
@wstcliyu wstcliyu force-pushed the wstcliyu/pw-unit branch 6 times, most recently from 5e282c0 to f86f57d Compare September 22, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants