'docker stack deploy with GPU , but can't find nvidia devices
docker stack deploy with GPU , but can't find nvidia devices
description:
When I use docker-compose up
start the program, the code works well! But when I use docker stack deploy -c docker-compose.yml test
to start the program, it can't find visible nvidia devices. My docker-compose.yml and error logs are shown as below.
I'am very confused about that why I have same configures, using different lanching way of docker-compose up
and docker stack deploy -c docker-compose.yml test
, the first working well, but the second not. Is it currently not perfect for docker swarm supporting for GPUs, or there are other ways I have not found?
environmental configuration
docker version: 18.06.0-ce
NVIDIA Docker: 1.0.1
Ubuntu: 16:04
/etc/docker/daemon.json
Of course, I modify the file /etc/docker/daemon.json, changed the runtime type. And restart it.
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl daemon-reload
sudo systemctl start docker
docker-compose.yml configure file
version: "3"
volumes:
nvidia_driver_430.14:
external: true
services:
tts-server:
build:
context: ./
dockerfile: ./docker/tts_server/Dockerfile
deploy:
replicas: 1
image: tts-system/tts-server-gpu
environment:
NVIDIA_VISIBLE_DEVICES: 0
devices:
- /dev/nvidia0
- /dev/nvidiactl
- /dev/nvidia-uvm
volumes:
- ./models:/tts_system/models:ro
- ./config:/tts_system/config:ro
- nvidia_driver_430.14:/usr/local/nvidia:ro
networks:
- overlay
ports:
- "9091:9090"
error logs of program
2019-07-02 07:50:24.805114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499885000 Hz
2019-07-02 07:50:24.808418: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4112870 executing computations on platform Host. Devices:
2019-07-02 07:50:24.808457: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-07-02 07:50:24.811640: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 07:50:24.811684: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:155] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
E0702 07:50:24.811846 1 decoder.cc:80] Filed to create session: Invalid argument: 'visible_device_list' listed an invalid GPU id '0' but visible device count is -1
This problem has been bothering me for a long time, thanks very much.
Solution 1:[1]
As per the issue,
https://github.com/docker/compose/issues/6691
There is no official support to --gpus and runtime reference of nvidia devices in docker-compose version 3 yet.
But you can install the nvidia-docker version 2 and have the below configuration in /etc/docker/daemon.json in order to make the available nvidia visible devices inside your swarm service.
/etc/docker/daemon.json
{
"default-runtime":"nvidia",
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[
]
}
}
}
docker-compose.yaml
Add the environment key in compose file in below format.
...
environment:
- NVIDIA_VISIBLE_DEVICES=0
...
Allowed values to NVIDIA_VISIBLE_DEVICES can be
NVIDIA_VISIBLE_DEVICES=all
Refer: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices
The above mentioned configurations seems to be working fine for me.
Solution 2:[2]
An official compose spec for v3 has been available as of Nov 17, 2021. PR here.
You can declare GPU resources like so:
Sample compose files:
services:
test:
image: nvidia/cuda
command: nvidia-smi
runtime: nvidia
or
services:
test:
image: nvidia/cuda
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- 'driver': 'nvidia'
'count': 1
'capabilities': ['gpu', 'utility']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Suresh Kumar B |
Solution 2 | user1847 |