Ok, so yesterday I wrote a post that removes the broken 6.1.0-18 kernel and put 6.1.0-17 right here. Today I propose another path to work around the NVIDIA bug. Update the kernel to 6.5 with backports. Originally, found this link that recommended that approach.

Install Linux 6.5 Kernel

sudo apt install -t bookworm-backports -y linux-image-amd64 linux-headers-amd64

sudo apt update && sudo apt upgrade

sudo apt purge -y linux-image-6.1.0-18-amd64

Blacklist Nouveau

Get it out of the way so that NVIDIA can load without any warning or further reboots needed.

cat << 'EOF' | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u
sudo reboot

Install NVIDIA nvidia-detect

Next, install nvidia-detect to tell the best graphics driver to use on your particular system Oh, this does not work have to do manually.

sudo apt install -y software-properties-common
sudo apt-add-repository contrib non-free non-free-firmware

Since the above does not work. Have to do it the hard way.

sudo nano /etc/apt/sources.list.d/debian.sources
Now past in like below. All I did was add contrib non-free non-free-firmware
To both areas where main is in the file and save that back to the file.
Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian.list
Suites: bookworm bookworm-updates bookworm-backports
Components: main contrib non-free non-free-firmware

Types: deb deb-src
URIs: mirror+file:///etc/apt/mirrors/debian-security.list
Suites: bookworm-security
Components: main contrib non-free non-free-firmware



Now, install the detector package

sudo apt update
sudo apt -y install nvidia-detect
nvidia-detect

This will output something like this. Note this will help determine the driver to use on your setup. So, do not expect my exact output on your system.

debian@cuda4:~$ nvidia-detect
Detected NVIDIA GPUs:
01:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)

Checking card:  NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
Your card is supported by the Tesla 470 drivers series.
It is recommended to install the
    nvidia-tesla-470-driver
package.

Now Install Drivers

Finally, it is time to do what we came here to do install the drivers.

sudo apt install -y build-essential gcc software-properties-common apt-transport-https dkms curl

Put down the correct driver library this is just an example use the output of nvidia-detect to pick actual driver needed. Note change the boldened one below.

sudo apt install -y firmware-misc-nonfree nvidia-tesla-470-driver

Test NVIDIA on Host

nvidia-smi

It will output something like

debian@cuda:~$ nvidia-smi
Mon Feb 12 08:39:42 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:01:00.0 Off |                    0 |
| N/A   65C    P0    58W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Time to Install Docker - Optional Skip if Docker is Already Installed

curl -fsSL https://get.docker.com -o get-docker.sh 
sudo sh get-docker.sh
sudo groupadd docker # this already exists in Debian after Install Docker.
sudo usermod -aG docker $USER
newgrp docker

Install Special Docker Components

These are needed to actually bind the NVIDIA card to Docker

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo tee /etc/apt/keyrings/nvidia-docker.key
curl -s -L https://nvidia.github.io/nvidia-docker/debian11/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo sed -i -e "s/^deb/deb \[signed-by=\/etc\/apt\/keyrings\/nvidia-docker.key\]/g" /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt -y install nvidia-container-toolkit
sudo systemctl restart docker

Perform Docker Test

By executing below you can test the cuda in Docker

docker run --gpus all nvidia/cuda:12.1.1-runtime-ubuntu22.04 nvidia-smi

Should output something like this...

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Mon Feb 12 10:37:01 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:01:00.0 Off |                    0 |
| N/A   41C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Well that is it now comes the uses. Such as AI. That is the whole reason I did this so I could build AI and test some ideas out.

Leave a Reply

Your email address will not be published. Required fields are marked *