Skip to content

Commit 765e627

Browse files
verdimrcVerdi March
andauthored
Fix nccl-test container image fails with cuda driver mismatch (#314)
Update cuda compat package in the container image to fix error: 7: ip-10-1-113-84: Test CUDA failure common.cu:894 'system has unsupported display driver / cuda driver combination' 7: .. ip-10-1-113-84 pid 738939: Test failure common.cu:844 Co-authored-by: Verdi March <[email protected]>
1 parent 6caf490 commit 765e627

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

micro-benchmarks/nccl-tests/nccl-tests.Dockerfile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,8 @@ RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
4444
openssh-server \
4545
pkg-config \
4646
python3-distutils \
47-
vim
47+
vim \
48+
&& apt-get install -y --upgrade ${NV_CUDA_COMPAT_PACKAGE}
4849

4950
RUN mkdir -p /var/run/sshd
5051
RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
@@ -60,6 +61,9 @@ RUN curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py \
6061

6162
#################################################
6263
## Install NVIDIA GDRCopy
64+
##
65+
## NOTE: if `nccl-tests` or `/opt/gdrcopy/bin/sanity -v` crashes with incompatible version, ensure
66+
## that the cuda-compat-xx-x package is the latest.
6367
RUN git clone -b ${GDRCOPY_VERSION} https://github.com/NVIDIA/gdrcopy.git /tmp/gdrcopy \
6468
&& cd /tmp/gdrcopy \
6569
&& make prefix=/opt/gdrcopy install

0 commit comments

Comments
 (0)