Need to catch when code attempts to run on a device ID that doesn't exist

There are cases where the runtime environment can be malformed and Devito is being asked to run CUDA code on a device ID that does not exist.

Example: When using `docker run` you cannot have both cli arguments `--env CUDA_VISIBLE_DEVICES` and `--gpus "device=${CUDA_VISIBLE_DEVICES:-all}"`. To see why, consider if we have export CUDA_VISIBLE_DEVICES=1, then setting --env CUDA_VISIBLE_DEVICES for docker run means that the docker runtime env will contain CUDA_VISIBLE_DEVICES=1; however, when you set docker run --gpus "device=${CUDA_VISIBLE_DEVICES:-all}", the docker runtime will only use GPU 1 but it renumbers it as zero. Therefore, when you run a cuda code inside docker the runtime only sees a single GPU device with device ID 0, but CUDA_VISIBLE_DEVICES is set to device id 1, and therefore you get an (uncaught) exception.

This leads to a very opaque failure that is difficult for users to understand and debug, e.g.:
```
tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.
```

Therefore, when running on GPUs, we need to sanity check that we can execute on the target device and if not emit a informative user message. Alternatively, it might be that we are not checking the error code of a cuda call that would provide the same message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Need to catch when code attempts to run on a device ID that doesn't exist #2711

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Need to catch when code attempts to run on a device ID that doesn't exist #2711

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions