Crash recovery kernel arming (kdump Rocky9)

Hi guys ^^,
I’m glad to post here, because I’ve read this forum many times, and helped me a lot (I am slightly new to Rocky 9.3, but years in CentOS 8.4).
image
This is the error I want to discuss with you, I’ll try to explain it as best I can.
I have a DAQ by PCIe that I need to reboot the OS to turn it on so I use a “.service” to do it.
This is the error I see, never happens after a Reboot, has to be after a Power On, but not happened always.

I have read about it in 2 post:
https://forums.rockylinux.org/t/crash-recovery-kernel-arming/11021
https://forums.rockylinux.org/t/how-do-i-remove-crashkernel-from-cmdline/13346

However, I don’t fully understand and solve the problem. It seems the problem is related to Kdump and Rocky 9, on the first post somebody talks about “Docs”, Where are these docs?
I have disabled Kdump.service, but I saw the error again. Can solve it by directly avoiding Kdump on the OS installation?

Any advice and help is welcome :slight_smile:
Regards

I’m running into the same issue using the RockyLinux 9 AMI on AWS.

[rocky@i-0978b586c4c18d4f2 ~]$ sudo systemctl list-units --failed
  UNIT          LOAD   ACTIVE SUB    DESCRIPTION                 
● kdump.service loaded failed failed Crash recovery kernel arming

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
[rocky@i-0978b586c4c18d4f2 ~]$ sudo systemctl status kdump.service
× kdump.service - Crash recovery kernel arming
     Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-06-18 15:47:44 UTC; 2min 59s ago
   Main PID: 1001 (code=exited, status=1/FAILURE)
        CPU: 40ms

Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal systemd[1]: Starting Crash recovery kernel arming...
Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal kdumpctl[1007]: kdump: No memory reserved for crash kernel
Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal kdumpctl[1007]: kdump: Starting kdump: [FAILED]
Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal systemd[1]: kdump.service: Failed with result 'exit-code'.
Jun 18 15:47:44 i-0978b586c4c18d4f2.eu-west-1.compute.internal systemd[1]: Failed to start Crash recovery kernel arming.
[rocky@i-0978b586c4c18d4f2 ~]$ 

This is how I look up the AMI:

data "aws_ami" "rocky_linux_9" {
  most_recent = true
  filter {
    name   = "name"
    values = ["Rocky-9-*x86_64-*"]
  }
  filter {
    name   = "architecture"
    values = ["x86_64"]
  }
  owners = ["679593333241"]
}

The actual AMI I get with this is

  • AMI ID: ami-0cb9745e56da171c2
  • AMI Name: Rocky-9-EC2-LVM-9.4-20240523.0.x86_64-prod-hyj6jp3bki4bm

I’m not sure how to proceed. Appreciate your help!

FWIW downgrading to Rocky 9.3 seems to fix it.

 data "aws_ami" "rocky_linux_9" {
   most_recent = true
   filter {
     name   = "name"
-    values = ["Rocky-9-*x86_64-*"]
+    values = ["Rocky-9-*9.3-*x86_64-*"]
   }
   filter {
     name   = "architecture"
     values = ["x86_64"]
   }
   owners = ["679593333241"]
 }

Now I realize also why I had to adjust my cloud-init, there’s some change in between Rocky 9.3 and 9.4 as well in the block device layout. Is this intentional? (Diff old is 9.3, new is 9.4)

runcmd:
-  - [ growpart, /dev/nvme0n1, 5 ]
-  - [ pvresize, /dev/nvme0n1p5 ]
-  - [ lvresize, -l, +100%FREE, /dev/mapper/rocky-root ]
+  - [ growpart, /dev/nvme0n1, 4 ]
+  - [ pvresize, /dev/nvme0n1p4 ]
+  - [ lvresize, -l, +100%FREE, /dev/mapper/rocky-lvroot ]
   - [ xfs_growfs, / ]

I had tested disabling the kdump however, I haven’t tested “erase” it:
grubby --update-kernel=ALL --args=“crashkernel=no”
grub2-mkconfig -o /boot/grub2/grub.cfg

The error changed, and I saw that it related to my “reboot”.
I have a .service file to do a reboot for a problem on the power of a DAQ PCI board. It seems that the reboot after checking the board is bad timing.
But I’m testing adding 5-10s before “reboot” command on the .service, and it seems the failure disappeared :open_mouth:
Still on test

It is not failing anymore.
My solution was to use a “sleep 5;” in the X.service I was using to reboot the Pc. Something in the timing of the command to reboot inside a .service was failing.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.