Thanks everyone! The only distribution that worked with NVIDIA A100 on GCP!

So I wanted to try this big ass AI transformer module (YaLM-100B).
I went to GCP, and fired up their most expensive GPU machine (that about $3/hour).

It was suggested that I would take the “ready-to-use” Debian 10 based image, with everything preinstalled, “to save setup time”.
Unfortunately, I didn’t even check that the GPU is identified correctly, assuming that this would just work with the image.

It took some time to download the 200GB~ data files, I fired up the supplied docker, and Oh no!
It cannot find the GPU! More time wasting trying to install the driver 5 ways, and nothing works…

I went to the Nvidia website, it seems that Debian 10 is not even supported anymore!
But there’s instruction for Debian 11, okay I closed my machine, and fired up another one with Debian 11 this time… Managed to configure the repository (was not easy), and installed cuda and another 200 packages.

I reboot, and Oh no! once again, not working. The detection tool specifically states that this driver version does not support my GPU. Tried installing the driver manually, few times and ways, and same result.

Okay, I thought to myself, okay something is probably wrong with this GPU and the driver compatibility, it’s new and everything, maybe something is not 100% supported.

Then I saw on Nvidia website install instructions for Rocky Linux:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda

As I was ready giving up, I thought lets give it another try with Rocky.
I ran the commands, rebooted system, typed nvidia-smi and Woha!

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    55W / 400W |      0MiB / 40960MiB |      2%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Exactly the same driver version for all my tries, and Rocky was the only one that just worked!

So thanks everyone who maintains and takes care of this amazing distro!

1 Like