I hope this is fixed soon. However, here is the workaround that is working for me for kernel version 6.1.0-18-amd64. BTW: this is the current one used in Debian 12 Bookworm. So, here is how I fixed it.

Remove Latest Kernel Since NVIDIA Broke

Once this is fixed I will remove this from here. But for now this is how I got the latest Debian 12 to load NVIDIA and work with Docker.

sudo apt purge  nvidia*
sudo apt remove linux-image-6.1.0-18-amd64 linux-headers-6.1.0-18-amd64
sudo apt install -y linux-image-6.1.0-17-amd64 linux-headers-6.1.0-17-amd64
sudo apt-mark hold linux-image-6.1.0-17-amd64 linux-headers-6.1.0-17-amd64

Also this will appear

You have to click No to continue removal of kernel. And yes this is a big deal!!! I hope you are doing this on a new VM and have a snapshot of it prior to doing this!!! You have been warned!!!

Reboot After Hold Is Placed

After the hold and not before because most of these images want to upgrade to the latest. Thus, may revert changes after reboot. Placing a hold on the kernel should force it to stay at 6.1.0-17 until the stupid NVIDIA bug is fixed. Then afterwards a unhold can be done. Note: I will update that in the future once issue is addressed.

Blacklist Nouveau

Also before reboot this is a good time to blacklist the nouveau driver that may be in the way of installing NVIDIA

run below to remove it from modules.

cat << 'EOF' | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u

Determine Need Drivers

For NVIDIA there are a few types of drivers see NvidaGraphicDrivers for more info. Also, install this to detect the recommended drivers

First, add needed repositories using add-repository. Note, you can do this manually too. However, for brevity I added this way. It is up to you the way you want to add the non default repositories required to install NVIDIA

sudo apt install -y software-properties-common
sudo apt-add-repository contrib non-free non-free-firmware

Apparently, there is a bug in apt-add-repository so have to do it the hard way!

sudo nano /etc/apt/sources.list.d/debian.sources

Before

Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian.list
Suites: bookworm bookworm-updates bookworm-backports
Components: main

Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian-security.list
Suites: bookworm-security
Components: main

After

Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian.list
Suites: bookworm bookworm-updates bookworm-backports
Components: main contrib non-free non-free-firmware

Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian-security.list
Suites: bookworm-security
Components: main contrib non-free non-free-firmware

Now we can run nvidia-detect

sudo apt update
sudo apt -y install nvidia-detect
nvidia-detect

It will output something like this

Detected NVIDIA GPUs:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)

Checking card:  NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Your card is supported by the Tesla 470 drivers series.
It is recommended to install the
    nvidia-tesla-470-driver
package.

Now Install Drivers

Add Needed Repositories

First, it is important to add the needed repositories to make this easy use

sudo apt install -y build-essential gcc software-properties-common apt-transport-https dkms curl

So, I am using Tesla K80 Graphics card so with Debian 12 I installed

sudo apt install -y firmware-misc-nonfree nvidia-tesla-470-driver

Most will install this one

sudo apt install -y firmware-misc-nonfree nvidia-driver

Now it is done we can test to see if it is working on the vm before using Docker

Test NVIDIA

Oh, before doing this it is recommended to restart. Many times the driver is not loaded yet due to another driver being in the way so a restart clears that all out. Then you can run the below command.

nvidia-smi

It will output something like

debian@cuda:~$ nvidia-smi
Mon Feb 12 08:39:42 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:01:00.0 Off |                    0 |
| N/A   65C    P0    58W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Time to Install Docker and Test

I am not going to show how to install Docker here as I have this one to do a Docker Install. So. look at that and come back here if you want to test Cuda in Docker.

Install Needed Docker Tools


curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo tee /etc/apt/keyrings/nvidia-docker.key
curl -s -L https://nvidia.github.io/nvidia-docker/debian11/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo sed -i -e "s/^deb/deb \[signed-by=\/etc\/apt\/keyrings\/nvidia-docker.key\]/g" /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt -y install nvidia-container-toolkit
sudo systemctl restart docker

Perform Docker Test

By executing below you can test the cuda in Docker

docker run --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi

Should output something like this...

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Mon Feb 12 10:37:01 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:01:00.0 Off |                    0 |
| N/A   41C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Well that is it now comes the uses. Such as AI. That is the whole reason I did this so I could build AI and test some ideas out.

Errors

Here are some of the errors that I am getting. I will update this once this error is corrected and it is not longer needed to roll the kernel and header back to load NVIDIA drivers.

dpkg: error processing package nvidia-tesla-470-driver (--configure):
 dependency problems - leaving unconfigured
Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+b1) ...
Processing triggers for libc-bin (2.36-9+deb12u4) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64
Processing triggers for update-glx (1.2.2) ...
Processing triggers for glx-alternative-nvidia (1.2.2) ...
update-alternatives: using /usr/lib/nvidia to provide /usr/lib/glx (glx) in auto mode
Processing triggers for glx-alternative-mesa (1.2.2) ...
Processing triggers for libc-bin (2.36-9+deb12u4) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64
Errors were encountered while processing:
 nvidia-tesla-470-kernel-dkms
 nvidia-tesla-470-driver
E: Sub-process /usr/bin/dpkg returned an error code (1)

typical errors that are see in current Debian 12 kernel 6.1.0-18-amd64
/var/lib/dkms/nvidia-tesla-470/470.223.02/build/make.log
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol '__rcu_read_lock'
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol '__rcu_read_unlock'
make[3]: *** [/usr/src/linux-headers-6.1.0-18-common/scripts/Makefile.modpost:126: /var/lib/dkms/nvidia-tesla-470/470.223.02/build/Module.symvers] Error 1
make[2]: *** [/usr/src/linux-headers-6.1.0-18-common/Makefile:1991: modpost] Error 2
make[2]: Leaving directory '/usr/src/linux-headers-6.1.0-18-amd64'
make[1]: *** [Makefile:250: __sub-make] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-6.1.0-18-common'
make: *** [Makefile:80: modules] Error 2

Leave a Reply

Your email address will not be published. Required fields are marked *