Ok, so yesterday I wrote a post that removes the broken 6.1.0-18 kernel and put 6.1.0-17 right here. Today I propose another path to work around the NVIDIA bug. Update the kernel to 6.5 with backports. Originally, found this link that recommended that approach.
Install Linux 6.5 Kernel
sudo apt install -t bookworm-backports -y linux-image-amd64 linux-headers-amd64
sudo apt update && sudo apt upgrade
sudo apt purge -y linux-image-6.1.0-18-amd64
Blacklist Nouveau
Get it out of the way so that NVIDIA can load without any warning or further reboots needed.
cat << 'EOF' | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u
sudo reboot
Install NVIDIA nvidia-detect
Next, install nvidia-detect to tell the best graphics driver to use on your particular system Oh, this does not work have to do manually.
sudo apt install -y software-properties-common
sudo apt-add-repository contrib non-free non-free-firmware
Since the above does not work. Have to do it the hard way.
sudo nano /etc/apt/sources.list.d/debian.sources
Now past in like below. All I did was add contrib non-free non-free-firmware
To both areas where main is in the file and save that back to the file.
Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian.list
Suites: bookworm bookworm-updates bookworm-backports
Components: main contrib non-free non-free-firmware
Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian-security.list
Suites: bookworm-security
Components: main contrib non-free non-free-firmware
Now, install the detector package
sudo apt update
sudo apt -y install nvidia-detect
nvidia-detect
This will output something like this. Note this will help determine the driver to use on your setup. So, do not expect my exact output on your system.
debian@cuda4:~$ nvidia-detect
Detected NVIDIA GPUs:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
Checking card: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Your card is supported by the Tesla 470 drivers series.
It is recommended to install the
nvidia-tesla-470-driver
package.
Now Install Drivers
Finally, it is time to do what we came here to do install the drivers.
sudo apt install -y build-essential gcc software-properties-common apt-transport-https dkms curl
Put down the correct driver library this is just an example use the output of nvidia-detect to pick actual driver needed. Note change the boldened one below.
sudo apt install -y firmware-misc-nonfree nvidia-tesla-470-driver
Test NVIDIA on Host
nvidia-smi
It will output something like
debian@cuda:~$ nvidia-smi
Mon Feb 12 08:39:42 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:01:00.0 Off | 0 |
| N/A 65C P0 58W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Time to Install Docker - Optional Skip if Docker is Already Installed
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo groupadd docker # this already exists in Debian after Install Docker.
sudo usermod -aG docker $USER
newgrp docker
Install Special Docker Components
These are needed to actually bind the NVIDIA card to Docker
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo tee /etc/apt/keyrings/nvidia-docker.key
curl -s -L https://nvidia.github.io/nvidia-docker/debian11/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo sed -i -e "s/^deb/deb \[signed-by=\/etc\/apt\/keyrings\/nvidia-docker.key\]/g" /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt -y install nvidia-container-toolkit
sudo systemctl restart docker
Perform Docker Test
By executing below you can test the cuda in Docker
docker run --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi
Should output something like this...
==========
== CUDA ==
==========
CUDA Version 12.1.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Mon Feb 12 10:37:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:01:00.0 Off | 0 |
| N/A 41C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Well that is it now comes the uses. Such as AI. That is the whole reason I did this so I could build AI and test some ideas out.