'docker stack deploy with GPU , but can't find nvidia devices

docker stack deploy with GPU , but can't find nvidia devices

description:

When I use docker-compose up start the program, the code works well! But when I use docker stack deploy -c docker-compose.yml test to start the program, it can't find visible nvidia devices. My docker-compose.yml and error logs are shown as below.

I'am very confused about that why I have same configures, using different lanching way of docker-compose up and docker stack deploy -c docker-compose.yml test, the first working well, but the second not. Is it currently not perfect for docker swarm supporting for GPUs, or there are other ways I have not found?

environmental configuration

docker version: 18.06.0-ce
NVIDIA Docker: 1.0.1
Ubuntu: 16:04

/etc/docker/daemon.json

Of course, I modify the file /etc/docker/daemon.json, changed the runtime type. And restart it.

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
sudo systemctl daemon-reload
sudo systemctl start docker

docker-compose.yml configure file

version: "3"

volumes:
  nvidia_driver_430.14:
    external: true

services:
  tts-server:
    build:
      context: ./
      dockerfile: ./docker/tts_server/Dockerfile
    deploy:
      replicas: 1
    image: tts-system/tts-server-gpu
    environment:
      NVIDIA_VISIBLE_DEVICES: 0
    devices:
      - /dev/nvidia0
      - /dev/nvidiactl
      - /dev/nvidia-uvm
    volumes:
      - ./models:/tts_system/models:ro
      - ./config:/tts_system/config:ro
      - nvidia_driver_430.14:/usr/local/nvidia:ro
    networks:
      - overlay
    ports:
      - "9091:9090"

error logs of program

2019-07-02 07:50:24.805114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499885000 Hz
2019-07-02 07:50:24.808418: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4112870 executing computations on platform Host. Devices:
2019-07-02 07:50:24.808457: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-02 07:50:24.811640: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 07:50:24.811684: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:155] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
E0702 07:50:24.811846     1 decoder.cc:80] Filed to create session: Invalid argument: 'visible_device_list' listed an invalid GPU id '0' but visible device count is -1

This problem has been bothering me for a long time, thanks very much.



Solution 1:[1]

As per the issue,

https://github.com/docker/compose/issues/6691

There is no official support to --gpus and runtime reference of nvidia devices in docker-compose version 3 yet.

But you can install the nvidia-docker version 2 and have the below configuration in /etc/docker/daemon.json in order to make the available nvidia visible devices inside your swarm service.

/etc/docker/daemon.json

{
   "default-runtime":"nvidia",
   "runtimes":{
      "nvidia":{
         "path":"nvidia-container-runtime",
         "runtimeArgs":[

         ]
      }
   }   
}

docker-compose.yaml

Add the environment key in compose file in below format.

...
environment:
    - NVIDIA_VISIBLE_DEVICES=0
...

Allowed values to NVIDIA_VISIBLE_DEVICES can be

NVIDIA_VISIBLE_DEVICES=all

Refer: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices

The above mentioned configurations seems to be working fine for me.

Solution 2:[2]

An official compose spec for v3 has been available as of Nov 17, 2021. PR here.

You can declare GPU resources like so:

Sample compose files:

services:
  test:
    image: nvidia/cuda
    command: nvidia-smi
    runtime: nvidia

or

services:
  test:
    image: nvidia/cuda
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - 'driver': 'nvidia'
            'count': 1
            'capabilities': ['gpu', 'utility']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Suresh Kumar B
Solution 2 user1847