Skip to content

Conversation

akkart-aws
Copy link
Collaborator

What?

Replace full EFA installer with minimal EFA installation and custom libfabric build from source. This change switches from using efa_installer.sh -y -g (which installs OpenMPI, NCCL, and other unwanted packages) to efa_installer.sh -y --minimal (which installs only RDMA core and EFA drivers) combined with a custom libfabric v2.3.0 build configured specifically for EFA, CUDA, and GDRCopy support.

Why?

The original EFA installer was installing unwanted components like OpenMPI and NCCL that were causing package conflicts and bloating the installation. The user specifically needed only EFA drivers and libfabric components without the additional packages that come with the full EFA installer.

How?

  1. Modified EFA installer usage: Changed from -y -g to -y --minimal flag to install only essential RDMA core and EFA driver components

  2. Added custom libfabric build: Downloads libfabric v2.3.0 from GitHub releases and builds with specific configure options:

    • --enable-efa for EFA provider support
    • --disable-verbs/psm3/opx/usnic/rstream to disable unused providers
    • --with-cuda=/usr/local/cuda --enable-cuda-dlopen for CUDA support
    • --with-gdrcopy --enable-gdrcopy-dlopen for GDRCopy support
  3. Updated build system: Added LIBFABRIC_VERSION and LIBFABRIC_INSTALL_DIR parameters following the existing UCX pattern, updated environment variables and meson configuration to use custom libfabric paths

  4. Applied consistently: Updated .gitlab/build.sh, contrib/Dockerfile, and benchmark/nixlbench/contrib/Dockerfile with the same approach

Copy link

copy-pr-bot bot commented Sep 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi akkart-aws! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@mkhazraee
Copy link
Contributor

/build

@mkhazraee
Copy link
Contributor

/ok to test 904ab5a

yosefe
yosefe previously approved these changes Sep 17, 2025
@akkart-aws akkart-aws force-pushed the efa_installer_minimum branch 2 times, most recently from a0433b3 to c47556d Compare September 25, 2025 01:12
@mkhazraee
Copy link
Contributor

/ok to test c47556d

@mkhazraee
Copy link
Contributor

/build

Use compile_args instead of include_directories when libfabric_path
is an absolute path to avoid meson build issues with external
library includes.
@mkhazraee
Copy link
Contributor

/ok to test ed12136

@mkhazraee
Copy link
Contributor

/build

Add hwloc and hwloc-devel packages to support hardware locality detection.
@mkhazraee
Copy link
Contributor

/ok to test 7889d3f

@mkhazraee
Copy link
Contributor

/build

@mkhazraee
Copy link
Contributor

/build

@mkhazraee
Copy link
Contributor

/ok to test 6b13de5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants