-
Notifications
You must be signed in to change notification settings - Fork 244
Description
There are cases where the runtime environment can be malformed and Devito is being asked to run CUDA code on a device ID that does not exist.
Example: When using docker run you cannot have both cli arguments --env CUDA_VISIBLE_DEVICES and --gpus "device=${CUDA_VISIBLE_DEVICES:-all}". To see why, consider if we have export CUDA_VISIBLE_DEVICES=1, then setting --env CUDA_VISIBLE_DEVICES for docker run means that the docker runtime env will contain CUDA_VISIBLE_DEVICES=1; however, when you set docker run --gpus "device=${CUDA_VISIBLE_DEVICES:-all}", the docker runtime will only use GPU 1 but it renumbers it as zero. Therefore, when you run a cuda code inside docker the runtime only sees a single GPU device with device ID 0, but CUDA_VISIBLE_DEVICES is set to device id 1, and therefore you get an (uncaught) exception.
This leads to a very opaque failure that is difficult for users to understand and debug, e.g.:
tests/test_gpu_openacc.py ...............
Error: Process completed with exit code 1.
Therefore, when running on GPUs, we need to sanity check that we can execute on the target device and if not emit a informative user message. Alternatively, it might be that we are not checking the error code of a cuda call that would provide the same message.