Skip to content

Need to catch when code attempts to run on a device ID that doesn't exist #2711

@ggorman

Description

@ggorman

There are cases where the runtime environment can be malformed and Devito is being asked to run CUDA code on a device ID that does not exist.

Example: When using docker run you cannot have both cli arguments --env CUDA_VISIBLE_DEVICES and --gpus "device=${CUDA_VISIBLE_DEVICES:-all}". To see why, consider if we have export CUDA_VISIBLE_DEVICES=1, then setting --env CUDA_VISIBLE_DEVICES for docker run means that the docker runtime env will contain CUDA_VISIBLE_DEVICES=1; however, when you set docker run --gpus "device=${CUDA_VISIBLE_DEVICES:-all}", the docker runtime will only use GPU 1 but it renumbers it as zero. Therefore, when you run a cuda code inside docker the runtime only sees a single GPU device with device ID 0, but CUDA_VISIBLE_DEVICES is set to device id 1, and therefore you get an (uncaught) exception.

This leads to a very opaque failure that is difficult for users to understand and debug, e.g.:

tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.

Therefore, when running on GPUs, we need to sanity check that we can execute on the target device and if not emit a informative user message. Alternatively, it might be that we are not checking the error code of a cuda call that would provide the same message.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions