Ansible vs. nvidia-detect

Hi,

I’m currently writing an Ansible role to automagically configure NVidia cards. Here’s the first two tasks:

- name: Install utility to detect NVidia graphics cards
  ansible.builtin.dnf:
    name: nvidia-detect
    state: present

- name: Detect NVidia graphics cards
  ansible.builtin.command:
    cmd: nvidia-detect
  register: nvidia_card

I tried to run this on a sandbox machine with an NVidia video card, but running the playbook spews out the following error:

TASK [nvidia_driver : Detect NVidia graphics cards] **********************************************************
fatal: [sandbox.microlinux.lan]: FAILED! => {"changed": false, "cmd": ["nvidia-detect"], "delta": "0:00:00.013
173", "end": "2024-05-01 12:59:21.576169", "msg": "non-zero return code", "rc": 8, "start": "2024-05-01 12:59:
21.562996", "stderr": "", "stderr_lines": [], "stdout": "kmod-nvidia-470xx", "stdout_lines": ["kmod-nvidia-470
xx"]}

But when I run the command directly on the machine, it works:

[root@sandbox:~] # nvidia-detect 
kmod-nvidia-470xx

I’m puzzled. Any ideas ?

If you look at the output, you can see an rc value of 8. This mean your command has returned a non-zero value. Usually when a command completes succesfully, you would have a zero. Since you don’t have this, underneath the register line add this:

  register: nvidia_card
  changed_when: nvidia_card.rc == 8

and see if that helps. You may also have to adapt that further for other error codes. For example, when I had this issue with writing playbooks for using yum, I had something like this:

  register: yum_upgradeable
  changed_when: yum_upgradeable.rc == 100
  failed_when:
    - yum_upgradeable.rc != 0
    - yum_upgradeable.rc != 100
    - yum_upgradeable.rc != 126

so you may want to run the playbook on different systems, and see what the rc value returns to adapt this. So at least for the beginning this would now look like:

  register: nvidia_card
  changed_when: nvidia_card.rc == 8
  failed_when:
    - nvidia_card.rc != 8

you can of course, also after running nvidia-detectmanually then immediately use:

echo $?

to see what code is returned, whether it is 0, 8 or something else on each of your systems. Can be quicker than running the playbook, unless the playbook returns different values. But it should be similar.

1 Like

Thanks very much for that detailed explanation. I gave this a spin on my workstation and my sandbox PC, and got a return code of 8 on both.

On the other hand, here’s what this looks like on my girlfriend’s PC:

[root@gustave:~] # nvidia-detect 
kmod-nvidia-390xx
[root@gustave:~] # echo $?
7

Looks like the guys who wrote this script can’t handle their return codes.

You can see if ansible also returns code 7 on the other computer and then add that to the list, eg:

  failed_when:
    - nvidia_card.rc != 7
    - nvidia_card.rc != 8

since it still reports a working supported graphics card, and thus the value is effectively a success. We are basically telling ansible to consider 7 and 8 as a success, similar to a code 0 that would normally get returned. It could well be for each different GPU, they return a different success code. But who knows :slight_smile:

1 Like

I experimented some more with it and then decided to go for the bone-headed KISS approach, which looks like this:

---  # tasks file for nvidia_driver
  
- name: Install the NVidia 550.xx driver
  ansible.builtin.dnf:
    name: nvidia-x11-drv
  when: nvidia_driver == "550xx"
  notify: Reboot

- name: Install the NVidia 470.xx driver
  ansible.builtin.dnf:
    name: nvidia-x11-drv-470xx
  when: nvidia_driver == "470xx"
  notify: Reboot

- name: Install the NVidia 390.xx driver
  ansible.builtin.dnf:
    name: nvidia-x11-drv-390xx
  when: nvidia_driver == "390xx"
  notify: Reboot

- name: Flush handlers
  meta: flush_handlers

...

If the target host has an NVidia card, then I just add the correct version in host_vars.

This also solves another problem. On some hosts, nvidia-detect returned the wrong driver version.

Cheers,

Niki

You can probably reduce that and make it a little better, something like this should work:

  vars:
    gpus:
      - type: 390xx
        driver: nvidia-x11-drv-390xx
      - type: 470xx
        driver: nvidia-x11-drv-470xx
      - type: 550xx
        driver: nvidia-x11-drv

  - name: Install the NVidia driver
    ansible.builtin.dnf:
      name: "{{ item.driver }}"
    when: nvidia_driver == "{{ item.type }}"
    loop: "{{ gpus }}"
1 Like