Weird display partial lockup

I’m running into a weird display driver lockup problem on one of my Rocky 9 boxes. The cursor still moves with the mouse and the keyboard is active such that I can get to an alternate console and kill the Xorg process, do a startx again and then run fine for anything from a few hours to a few days before the next lockup.

Research here and with Google seems to indicate that the Radeon and/or amdgpu driver may be the problem. Oddly (to me) both seem to be getting loaded:

[dave@bend ~]$ lsmod | egrep 'amdgpu|radeon'
amdgpu              11087872  0
iommu_v2               24576  1 amdgpu
drm_buddy              20480  1 amdgpu
gpu_sched              57344  1 amdgpu
radeon               2068480  16
drm_ttm_helper         16384  2 amdgpu,radeon
ttm                    98304  3 amdgpu,radeon,drm_ttm_helper
video                  73728  2 amdgpu,radeon
drm_display_helper    200704  2 amdgpu,radeon
drm_kms_helper        245760  3 drm_display_helper,amdgpu,radeon
drm                   704512  20 gpu_sched,drm_kms_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm
i2c_algo_bit           16384  3 igb,amdgpu,radeon

Video card is:

04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Bonaire XTX [Radeon R7 260X/360] (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. AMD Radeon R7 260X
        Physical Slot: 2
        Flags: bus master, fast devsel, latency 0, IRQ 64
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=8M]
        I/O ports at 2000 [size=256]
        Memory at f0900000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: radeon
        Kernel modules: radeon, amdgpu

I can blacklist either driver but wondering if 1) this is the right solution? and 2) which one?

My other Rocky 9 box with a NVidia graphics card does not have this problem.

Thanks!

I’ve had this problem on my amd box since the upgrade to 9.3. Running the last kernel from 9.2 fixes it but obviously you then miss security updates, so I’m now running kernel-lt from elrepo which is stable.

The fact you can get to a console is helpful. You should be able to run Top to see if it is an application that is zombied. I have a hard time using journalctl to look for errors so I installed rsyslog so that I get the traditional logs in /var/log. When this desktop freeze happens there should be some ouput at the end of /var/log/messages or /var/log/gdm/gdm.log . I’m assumining your using the default desktop gnome. I use Mate so I know nothing about wayland.

It is the amdgpu driver in the kernel which locks up, as seen in dmesg output. It is not associated with any particular application, although running video applications like VLC or gmplayer almost always triggered the bug within a few seconds to minutes.

Running Mate here too so no gdm.log and haven’t seen any “smoking gun” in messages or dmesg. I usually only get the lockup after being away from the system for a while (hours) which makes it hard to find an obscure event in dmesg…especially when I don’t know exactly what I’m looking for.

I start my “real” systems in text mode and then do a startx. Habit from many years of getting X working. I ran into a different weird problem with getting VNC working on VMs due to gdm grabbing the bus. Just as soon not have gnome on my systems.

Cheers

Running a video intensive application fits with what I’m seeing. System usually runs fine for several days if I don’t have any graphics intensive program running but locks after a few hours or less if i do. I’ll try blacklisting the amdgpu module and see what happens. I tend to leave my system up so I can jump back into whatever I was doing but having the display be unusable kind of shoots that down.

Update:

Blacklisted amdgpu module using method on RedHat web site. Good so far. Will leave something graphical running and see what happens.

[root@bend ~]# lsmod | egrep 'amdgpu|radeon'
radeon               2068480  16
drm_ttm_helper         16384  1 radeon
ttm                    98304  2 radeon,drm_ttm_helper
video                  73728  1 radeon
drm_display_helper    200704  1 radeon
drm_kms_helper        245760  2 drm_display_helper,radeon
drm                   704512  17 drm_kms_helper,drm_display_helper,radeon,drm_ttm_helper,ttm
i2c_algo_bit           16384  2 igb,radeon

Thanks!

Quick follow up: Better but still not really fixed. Blacklisting the amdgpu module has decreased the frequency of the lock-ups I’m getting but have gotten two since my previous post.

The lockups appear to be happening when the screensaver detects activity and unblanks the screen. Also, the lockups only appear to happen when the screensaver has put the display “to sleep” and the monitor has to get powered up; not when the screensaver is still showing my usual slide show.

I disabled screensavers and “idle blanking” a long time ago. Probably because of the unsolved issue you are seeing. This isn’t a gpu specific issue. I would see this happening on intel and nvidia graphics also.

Screen blanking is an old habit from living in the world of CRTs for years. First time I’ve seen it cause something this severe and I usually leave systems up:
`[dave@fraud ~]$ uptime

08:56:47 up 177 days, 19:02, 5 users, load average: 0.00, 0.01, 0.05`
The connection to heavy graphics use could explain me not seeing it since that’s not my typical system use.

Hopefully one last update. Same problem this morning but I tried something different instead of restarting Xorg from a virtual console. I cycled power on the monitor which shouldn’t help if the driver is wedged. Everything came back.

Guessing that cycling power caused the monitor to fully re-synch while just unblanking the screen (signal to monitor) was leaving the display hung on the driver side. Since the display is still updated until the unblank event, the driver is somehow going off the rails when it comes back up.