'Cannot install NVIDIA GPU driver 470.82.01 on the on Google Kubernetes Engine 1.21
I would like to run GPU nodes in a GKE cluster, that requires an installation DaemonSet. According to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, NVIDIA driver 470 is supported for the latest GKE version 1.21.
The default DaemonSet installs the driver version 450 and my node works just fine:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
But, since I need 470, I also tried to deploy the latest DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
Unfortunately, using this latter command on the cluster, the GPU node never starts because the DaemonSet pod constantly fails and retries to run.
Update:
Finally I've got some logs. We've changed the DaemonSet config above to a Deployment and tried to manually launch the driver installation script /cos-gpu-installer install --version=latest
but got the following error:
00:04.0 3D controller: NVIDIA Corporation Device 1db1 (rev a1)
I0512 14:08:30.070789 8742 installer.go:401] Getting the latest GPU driver version
I0512 14:08:30.071255 8742 utils.go:88] Downloading gpu_latest_version from https://storage.googleapis.com/cos-tools/16108.604.19/gpu_latest_version
I0512 14:08:30.173334 8742 install.go:132] Installing GPU driver version 470.82.01
I0512 14:08:30.173409 8742 cache.go:72] map[BUILD_ID:16108.604.19 DRIVER_VERSION:450.119.04]
I0512 14:08:30.173490 8742 installer.go:102] Configuring driver installation directories
I0512 14:08:30.467320 8742 signature.go:30] Downloading driver signature for version 470.82.01
I0512 14:08:30.467360 8742 utils.go:88] Downloading 470.82.01.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.604.19/extensions/gpu/470.82.01.signature.tar.gz
I0512 14:08:30.470927 8742 signature.go:37] Decompressing signature /build/sign-gpu-driver/470.82.01.signature.tar.gz
I0512 14:08:30.476162 8742 installer.go:92] Downloading GPU driver installer version 470.82.01
I0512 14:08:30.477134 8742 utils.go:88] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-eu-public/nvidia-cos-project/89/tesla/470_00/470.82.01/NVIDIA-Linux-x86_64-470.82.01_89-16108-604-19.cos
I0512 14:08:32.361778 8742 utils.go:88] Downloading toolchain_env from https://storage.googleapis.com/cos-tools/16108.604.19/toolchain_env
I0512 14:08:32.371396 8742 cos.go:71] Installing the toolchain
I0512 14:08:32.371467 8742 cos.go:77] Found existing toolchain. Skipping download and installation
I0512 14:08:32.371506 8742 installer.go:288] Running GPU driver installer
I0512 14:08:40.924340 8742 installer.go:139] Extracting precompiled artifacts...
I0512 14:08:41.090412 8742 installer.go:166] Done extracting precompiled artifacts
I0512 14:08:41.090441 8742 installer.go:171] Linking drivers...
I0512 14:08:41.090512 8742 installer.go:192] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.4.170+/scripts/module-common.lds -r -o /tmp/extract/kernel/nvidia.ko /tmp/extract/kernel/precompiled/nv-linux.o /tmp/extract/kernel/nvidia/nv-kernel.o_binary]
I0512 14:08:41.505273 8742 installer.go:203] Running link command: [/build/cos-tools/bin/ld.lld -T /build/cos-tools/usr/src/linux-headers-5.4.170+/scripts/module-common.lds -r -o /tmp/extract/kernel/nvidia-modeset.ko /tmp/extract/kernel/precompiled/nv-modeset-linux.o /tmp/extract/kernel/nvidia-modeset/nv-modeset-kernel.o_binary]
I0512 14:08:41.519532 8742 installer.go:219] Done linking drivers
I0512 14:08:41.943926 8742 modules.go:69] Loading gpu-key to secondary system keyring
I0512 14:08:41.947421 8742 modules.go:81] Successfully load key gpu-key into secondary system keyring.
I0512 14:08:41.954888 8742 installer.go:265] Installing userspace libraries...
I0512 14:08:41.954921 8742 installer.go:277] Installer arguments:
[/tmp/extract/nvidia-installer --utility-prefix=/usr/local/nvidia --opengl-prefix=/usr/local/nvidia --x-prefix=/usr/local/nvidia --install-libglvnd --no-install-compat32-libs --log-file-name=/usr/local/nvidia/nvidia-installer.log --silent --accept-license --no-kernel-module]
E0512 14:08:41.962811 8742 utils.go:355]
E0512 14:08:41.963442 8742 utils.go:355] WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that an NVIDIA kernel module matching this driver version is installed separately.
E0512 14:08:41.963493 8742 utils.go:355]
E0512 14:08:41.963592 8742 utils.go:355]
E0512 14:08:41.963613 8742 utils.go:355] WARNING: nvidia-installer was forced to guess the X library path '/usr/local/nvidia/lib64' and X module path '/usr/local/nvidia/lib64/xorg/modules'; these paths were not queryable from the system. If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
E0512 14:08:41.963629 8742 utils.go:355]
I0512 14:08:46.400148 8742 installer.go:281] Done installing userspace libraries
I0512 14:08:46.400262 8742 cache.go:58] Updated cached version as
I0512 14:08:46.400285 8742 cache.go:60] BUILD_ID=16108.604.19
I0512 14:08:46.400292 8742 cache.go:60] DRIVER_VERSION=470.82.01
I0512 14:08:46.400329 8742 installer.go:45] Verifying GPU driver installation
E0512 14:08:46.432573 8742 install.go:276] failed to verify installation: failed to verify GPU driver installation: exit status 255
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|