Rocky v8.4 on Dell XPS 8940 is unstable!

I’ve installed the official Rocky v8.4 dvd on my brand new Dell XPS 8940.

Even though it seems to install just fine, the system reboots itself after a period of idleness.

I’ve turned off the screen lock, the background is a plain color, and so far as I know there is nothing going on. The machine has a hard-wired LAN connection, so WiFi and Bluetooth are turned off.

It happens whether I leave the system idle with a user logged in or just leave it at the console login. I notice that the fans spin after about an hour.

The updater wanted to update Rocky and Firefox. Here are the results of uname -a:

Linux localhost.localdomain 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Tue Jul 20 20:34:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The system is stable while booted into Windows 10 Pro.

My next step is to install the standard CentOS 8 iso and see how it behaves.

I expect a system to stay up and running for an arbitrary time so long as there is no activity.

do you get linux system log on the unstable machine? you can only find out what happend on your PC by checking the system log. :sweat_smile:

Not this time. I did turn on journald/journalctl (I created /var/log/journal), but I didn’t get a chance to collect that log. I’ll do so on the next time around – probably sometime tomorrow.

It doesn’t matter so much at the moment because I don’t have enough TL points here to attach any files :(.

If you can provide some things to look for, I can copy and paste excerpts.

I understand the challenges of moderating a site like this. I would welcome an opportunity to, for example, kick in a reasonable premium and get such privileges in return.

Hello @SomervilleTom

when that happen and when the system is ideal does the screen go off or not?

if yes then most probably it issue related to your video driver

Typically, the system reboots and ends up at the login screen.

I’ll login to the console as a user and I’ll see the regular gnome desktop (I leave the background dark green so that it’s easy to see from across the room). I keep screen-lock turned off, I’ve at least attempted to turn off all the things that attempt to hibernate and so on. So I expect the system to simply sit there with a green screen. That, by the way, is exactly what happens when I boot into Windows 10 pro and log in. So I know this is not a hardware issue.

What typically happens is that after a time – 1 hour? 2-3 hours? I’m not sure – I’ll notice that the fans are running. I’ll look over at the machine, and the screen will be black. When I hit “Enter” from the keyboard, I’m back at the console login.

Perhaps there is a “feature” that logs out a user after some timeout. If so, I’ve never heard of it and I want to turn it off.

More likely is that the system has crashed, rebooted, and returned to the console login prompt.

screen lock is about loging you out after certain time or when your screen go off

console mean the black window the terminal the cmd in windows term but gnome is the gui so which one you login to

i never said the issue is with your hardware i talked about issue related to your video driver (so in other way it software issue) but i can not be sure without check the log

not sure what your software and it fine that the fan run when the cpu increase maybe there service that run when your device is ideal

go to the setting then go for power then change setting for blank screen
that would keep the screen on

Here are some excerpts from the persistent journal, collected from journalctl:

Jul 31 11:51:57 localhost.localdomain systemd[1]: Received SIGRTMIN+21 from PID 595 (plymouthd).
Jul 31 11:51:57 localhost.localdomain systemd[1]: Started Hold until boot process finishes up.
Jul 31 11:51:57 localhost.localdomain systemd[1]: Reached target Multi-User System.
Jul 31 11:51:57 localhost.localdomain systemd[1]: Reached target Graphical Interface.
Jul 31 11:51:57 localhost.localdomain systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jul 31 11:51:57 localhost.localdomain systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Jul 31 11:51:57 localhost.localdomain systemd[1]: Started Update UTMP about System Runlevel Changes.
Jul 31 11:51:57 localhost.localdomain systemd[1]: Startup finished in 11.289s (firmware) + 6.630s (loader) + 1.242s (kernel) + 6.212s (initrd) + 10.985s (userspace) = 36.359s.
Jul 31 11:51:58 localhost.localdomain chronyd[1205]: Selected source 162.221.74.15
Jul 31 11:51:58 localhost.localdomain chronyd[1205]: System clock TAI offset set to 37 seconds
Jul 31 11:52:04 localhost.localdomain systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Jul 31 11:52:26 localhost.localdomain systemd[1]: systemd-localed.service: Succeeded.
Jul 31 11:52:27 localhost.localdomain systemd[1]: systemd-hostnamed.service: Succeeded.
Jul 31 11:52:27 localhost.localdomain systemd[1]: fprintd.service: Succeeded.
-- Reboot --
Jul 31 11:53:56 localhost.localdomain kernel: printk: systemd: 16 output lines suppressed due to ratelimiting
Jul 31 11:53:56 localhost.localdomain kernel: audit: type=1404 audit(1627732435.848:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=selinux res=1
Jul 31 11:53:56 localhost.localdomain kernel: SELinux:  policy capability network_peer_controls=1
Jul 31 11:53:56 localhost.localdomain kernel: SELinux:  policy capability open_perms=1
...
Jul 31 11:53:56 localhost.localdomain kernel: input: Dell AIO WMI hotkeys as /devices/virtual/input/input7
Jul 31 11:53:56 localhost.localdomain kernel: usbcore: registered new interface driver btusb
Jul 31 11:53:56 localhost.localdomain kernel: Bluetooth: hci0: Bootloader revision 0.4 build 0 week 30 2018
Jul 31 11:53:56 localhost.localdomain kernel: Bluetooth: hci0: Device revision is 2
Jul 31 11:53:56 localhost.localdomain kernel: Bluetooth: hci0: Secure boot is enabled
Jul 31 11:53:56 localhost.localdomain kernel: Bluetooth: hci0: OTP lock is enabled
-- Reboot --
Jul 31 11:55:44 localhost.localdomain kernel: printk: systemd: 16 output lines suppressed due to ratelimiting
Jul 31 11:55:44 localhost.localdomain kernel: audit: type=1404 audit(1627732543.811:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=selinux res=1
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability network_peer_controls=1
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability open_perms=1
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability extended_socket_class=1
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability always_check_network=0
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability cgroup_seclabel=1
Jul 31 11:55:44 localhost.localdomain kernel: SELinux:  policy capability nnp_nosuid_transition=1
Jul 31 11:55:44 localhost.localdomain kernel: audit: type=1403 audit(1627732543.921:3): auid=4294967295 ses=4294967295 lsm=selinux res=1

That happened while the system was unattended. These are clearly reboots, which are different from logging out a user, and they appear to have occurred at 11:52 and 11:57. In fact, they happen about every 2-5 minutes (based on search of the log):

	Line 11920: -- Reboot --
	Line 13166: -- Reboot --
	Line 14416: -- Reboot --
	Line 15664: -- Reboot --
	Line 16988: -- Reboot --
	Line 21547: -- Reboot --
	Line 23678: -- Reboot --
	Line 25564: -- Reboot --
	Line 26191: -- Reboot --
	Line 26826: -- Reboot --
	Line 26904: -- Reboot --

I’ve allegedly turned off the screen lock, that shouldn’t be an issue. I’ve already adjusted the settings to make the screen blank, and I’ve already configured the power management to never hibernate and to power down when I press the power button.

I’ve been using CentOS 5 through CentOS 7 for more more than ten years, I’m somewhat familiar with what the system should be doing.

I do see some indications of graphics driver issues in the journal and at the console:

Jul 30 18:22:12 localhost.localdomain gnome-shell[1972]: Error setting property 'Powered' on interface org.bluez.Adapter1: GDBus.Error:org.bluez.Error.Blocked: Blocked through rfkill (g-io-error-quark, 36)
...

In the midst of at least one shutdown I see this:
Jul 30 20:09:29 localhost.localdomain kernel: nouveau 0000:02:00.0: tmr: stalled at ffffffffffffffff
Jul 30 20:09:29 localhost.localdomain kernel: ------------[ cut here ]------------
Jul 30 20:09:29 localhost.localdomain kernel: nouveau 0000:02:00.0: timeout
Jul 30 20:09:29 localhost.localdomain kernel: WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:38 g84_bar_flush+0xcc/0xe0 [nouveau]

I also see the following:

Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]: event severity: fatal
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:  Error 0, type: fatal
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:   section type: unknown, 81212a96-09ed-4996-9471-8d729c8e69ed
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:   section length: 0xc20
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:   00000000: 00000201 00000000 00000000 01042001  ............. ..
... (~800 lines elided, all similar to these)
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:   00000c00: ffffffff ffffffff ffffffff ffffffff  ................
Jul 30 14:20:25 localhost.localdomain kernel: [Hardware Error]:   00000c10: ffffffff ffffffff ffffffff ffffffff  ................

Here is some info about the fan, from the same log later today:

Jul 31 12:21:57 localhost.localdomain systemd[1]: geoclue.service: Succeeded.
Jul 31 12:42:47 localhost.localdomain kernel: nouveau 0000:02:00.0: therm: temperature (90 C) hit the 'fanboost' threshold
Jul 31 12:43:50 localhost.localdomain kernel: pcieport 0000:00:01.0: Multiple Corrected error received: 0000:00:01.0
Jul 31 12:43:50 localhost.localdomain kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jul 31 12:43:50 localhost.localdomain kernel: pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00000001/00002000
Jul 31 12:43:50 localhost.localdomain kernel: pcieport 0000:00:01.0:    [ 0] RxErr                 
Jul 31 12:43:50 localhost.localdomain systemd[1]: Starting dnf makecache...
Jul 31 12:43:51 localhost.localdomain kernel: xhci_hcd 0000:02:00.2: can't change power state from D3cold to D0 (config space inaccessible)
Jul 31 12:43:51 localhost.localdomain kernel: xhci_hcd 0000:02:00.2: can't change power state from D3hot to D0 (config space inaccessible)
Jul 31 12:43:51 localhost.localdomain kernel: xhci_hcd 0000:02:00.2: Controller not ready at resume -19
Jul 31 12:43:51 localhost.localdomain kernel: xhci_hcd 0000:02:00.2: PCI post-resume error -19!
Jul 31 12:43:51 localhost.localdomain kernel: xhci_hcd 0000:02:00.2: HC died; cleaning up

I don’t know what that last line means (“HC died; cleaning up”).

There is evidence elsewhere that the nouveau driver is incompatible with the nVidia driver. Perhaps this is the source of some or all instability?

Nouveau cannot be used when the nvidia driver is loaded. The same is if nouveau is loaded, the nvidia driver cannot be used. You are either using one or the other. This is not the source of your problem. Also:

Jul 30 18:22:12 localhost.localdomain gnome-shell[1972]: Error setting property 'Powered' on interface org.bluez.Adapter1: GDBus.Error:org.bluez.Error.Blocked: Blocked through rfkill (g-io-error-quark, 36)

that is a wireless or bluetook problem, rfkill has nothing to do with nouveau.

Your problem is this:

you have a hardware problem because of the message “hardware error”. I suggest looking inside your machine, reseating the memory and booting from an ISO which has memtest on it so you can check your memory. Reseat any PCI cards, including your graphics card, network cards or whatever else you have in case they are not fitted properly.

It also doesn’t rule out that some of your hardware might even be faulty, and reseating components might not help, but it’s a good place to start.

1 Like

@Your problem is this:

This is really helpful, thank you.

So far, that complaint has occurred twice since Friday. I realize that I apparently omitted a key piece of context when I posted the log excerpt – the preceding line, saying that these were from BERT. Here’s beginning of the most recent occurrence with the BERT notice included

Jul 31 07:45:18 localhost.localdomain kernel: BERT: Error records from previous boot:
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]: event severity: fatal
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]:  Error 0, type: fatal
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]:   section type: unknown, 81212a96-09ed-4996-9471-8d729c8e69ed
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]:   section length: 0xc20
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]:   00000000: 00000201 00000000 00000000 01042001  ............. ..
...
Jul 31 07:45:18 localhost.localdomain kernel: [Hardware Error]:   00000c10: ffffffff ffffffff ffffffff ffffffff  ................

I note an apparently relevant Bugzilla item:

...
It is just you're lucky to have a firmware that tells you extra information of the previous failure.
So if you often got this BERT log after system failures, then let's fix the failure first.
If you got this BERT log very occasionally, then you can ignore it, just like you can ignore the failure that causes this log. :) what do you think?

That thread was closed because of insufficient data last March. I notice that the referenced system was another Dell machine. Since mine is a brand-new system, I think I’ll open a ticket with Dell and see what they have to say.

For now, I’m going to resolve the nouveau/nvidia conflict and see where this lands.

I appreciate your attention and helpful response!

Just blacklist the module:

$ grep black /etc/default/grub
GRUB_CMDLINE_LINUX=" … module_blacklist=nouveau … "

$ cat /etc/modprobe.d/nouveau.conf
blacklist nouveau

$ depmod -a

and rebuild your grub config …

1 Like

console mean the black window the terminal the cmd in windows term but gnome is the gui so which one you login to

I use “console” to mean the screen that appears on the monitor after the system has finished starting. It is generally a black screen with a password prompt for a default non-root user.

I’ve used the “settings” app to make the background a solid color, turn off screen-lock, never turn off the screen, and so on.

I’ve followed the various recommendations and turned off nouveau in the boot loader/grub (hat-tip to Ritov). That seems to have resolved the hardware error complaints and the reboots.

It appears that Gnome logs out any given session after some timeout period. My next step is to replace Gnome with KDE Plasma. When I settle on a GUI that I can tolerate, I’ll spend more time adjusting its configuration.

Now that I seem to have a stable configuration (upgrading the Rocky build and turning off nouveau seem to have solved those problems), I’ll return to the task of getting VMWare Workstation Pro 16 running on this platform.

I returned the Dell XPS-8940 and got a full refund. I’ll be back after I get Rocky Linux installed on more cooperative hardware.

1 Like