Rocky 8 server freezes after 3-10 hours, journalctl killed. Kernel/IO issue?

selfhost · July 8, 2023, 5:06pm

Hi Rocky community!

I have a multi-purpose linux server at home which has ran Fedora/Fedora Server/CentOS/Rocky over the years without any major issues.

However, over the past couple of weeks, in the afternoon every day the server appears to freeze/lock up. I suspect a kernel/IO error.

Current server configuration:

100Mbps down, 10Mbps up CAT6 residential connection
3TB HDD: /dev/sdb, contains /
120GB SSD: /dev/sda, contains /mnt/SSD01_120G. Only stores MariaDB mysql databases

$ inxi -v4z:

System:
  Kernel: 4.18.0-477.15.1.el8_8.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 8.5.0
    Console: pty pts/1 Distro: Rocky Linux release 8.8 (Green Obsidian) base: RHEL 8.8
Machine:
  Type: Desktop product: PRIME Z270-P v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: PRIME Z270-P v: Rev X.0x serial: <superuser required>
    UEFI: American Megatrends v: 1205 date: 05/11/2018
CPU:
  Info: quad core model: Intel Core i7-7700K bits: 64 type: MT MCP arch: Kaby Lake rev: 9 cache:
    L1: 256 KiB L2: 1024 KiB L3: 8 MiB
  Speed (MHz): avg: 4400 min/max: 800/4500 cores: 1: 4400 2: 4400 3: 4400 4: 4400 5: 4400
    6: 4400 7: 4400 8: 4400 bogomips: 67200
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel HD Graphics 630 vendor: ASUSTeK driver: i915 v: kernel arch: Gen-9.5
    bus-ID: 00:02.0
  Device-2: NVIDIA GP107 [GeForce GTX 1050 Ti] driver: nouveau v: kernel arch: Pascal
    bus-ID: 01:00.0 temp: 36.0 C
  Display: web server: X.org v: 1.20.11 with: Xwayland v: 21.1.3 driver: X: loaded: nvidia
    unloaded: fbdev,modesetting,nouveau,vesa gpu: nouveau tty: 220x59 resolution: 1920x1080
  API: OpenGL Message: GL data unavailable in console. Try -G --display
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASUSTeK PRIME B450M-A
    driver: r8169 v: kernel port: d000 bus-ID: 04:00.0
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: br-6b782f5b73d5 state: up speed: 10000 Mbps duplex: unknown mac: <filter>
  ...
  IF-ID-4: docker0 state: down mac: <filter>
  IF-ID-5: pterodactyl0 state: up speed: 10000 Mbps duplex: unknown mac: <filter>
  IF-ID-6: veth049e9bb state: up speed: 10000 Mbps duplex: full mac: <filter>
  ...
  IF-ID-19: virbr0 state: down mac: <filter>
Drives:
  Local Storage: total: 2.84 TiB used: 2.13 TiB (75.1%)
  ID-1: /dev/sda vendor: Kingston model: SUV400S37120G size: 111.79 GiB
  ID-2: /dev/sdb vendor: Toshiba model: HDWD130 size: 2.73 TiB
Partition:
  ID-1: / size: 2.68 TiB used: 2.12 TiB (79.3%) fs: ext4 dev: /dev/sdb3
  ID-2: /boot size: 973.4 MiB used: 312.7 MiB (32.1%) fs: ext4 dev: /dev/sdb2
  ID-3: /boot/efi size: 499.7 MiB used: 5.8 MiB (1.2%) fs: vfat dev: /dev/sdb1
  ID-4: swap-1 size: 8 GiB used: 2.9 GiB (36.3%) fs: swap dev: /dev/sdb4
Info:
  Processes: 317 Uptime: 6h 21m Memory: available: 31.27 GiB used: 18.32 GiB (58.6%) Init: systemd
  target: graphical (5) Compilers: gcc: 8.5.0 Packages: 51 note: see --rpm Shell: Zsh v: 5.5.1
  inxi: 3.3.27

Recent changes (chronological)

Enabled live kernel patching (updates) via cockpit
Half-installed postfix. Haven’t finished configuring and hardening. Not currently allowed through the firewall.
Changed certbot’s cronjob to execute at a more random time:
0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && certbot renew
Setup 15 cronjobs to make an API request to point my domains to my current IP as my IP is dynamic:
0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && curl "https://api-endpoint.com/?...
Setup fstab to mount the SSD on boot
UUID=... /mnt/SSD01_120G ext4 defaults 1 2
Moved /var/lib/mysql to /mnt/SSD01_120G/var/lib/mysql and changed the appropriate socket configurations in mariadb, php-fpm etc
Setup fail2ban to protect ssh, nginx, sftp (finally)
Started hosting endlessh on port 22 using docker (actual ssh is on another port in the 1024+ range)
Disabled live kernel patching via cockpit, and uninstalled kpatch kpatch-dnf for good measure

Server freezing symptoms which usually occur in the afternoon:

My Minecraft servers still accept player joins, but players have very unstable connectivity issues, until the servers stop accepting joins altogether
My webservers are very slow, until they stop loading at all
When attempting to SSH into the server, my key is accepted and I’m prompted to enter my TOTP code, but once entered nothing happens (no output via ssh -vvvvv)
The disk usage light on the front of the PC tower flashes very periodically (usually it flashes multiple times per second)
Cannot login via a monitor/keyboard at the local tty screen. Upon entering my username, no password prompt appears, and then the login times out after 60 seconds
The only way to regain control of the server is to hard reboot it using the physical power button on the PC case

Logs

All logs available here: Shared Folder
Due to image, link and character limits imposed on new forum accounts, I have placed all logs into this shared folder. Please note that hardened browsers with JavaScript Just-In-Time disabled may take up to a minute to initially ‘decrypt’ the link.

tty login screen

A monitor connected to the server shows logs appearing on the tty login screen when the server is in the frozen/halted state: Please see shared folder link above

Journalctl

Upon regaining control of the server after a forced reboot, journalctl logs abruptly stop during the time where I experience the aforementioned symptoms, but well before I reboot:
Command used to gather logs:
$ sudo journalctl --since "2023-06-04" | grep -B50 -A3 "\-\- Reboot \-\-" | grep -iv "unit-which-logs-personal-info" | sed -e 's/info-to-redact/redacted/g'
Command output: Please see shared folder link above

Journalctl log clarification

Example of a healthy/intentional shutdown/reboot:

Jul 04 22:10:36 host.domain.com systemd-journald[569]: Journal stopped
-- Reboot --
Jul 04 22:11:00 host.domain.com kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12

Example of a system freeze where I manually force shutdown the machine via the power button several hours later:

Jul 04 17:37:25 host.domain.com sshd[682493]: error: kex_exchange_identification: Connection closed by remote host
-- Reboot --
Jul 04 19:30:28 host kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12

The warning
WARNING: Failed starting API: listen tcp 127.0.0.1:8384: bind: address already in use
is likely caused by me using ssh user@host -L 8384:localhost:8384 to reverse port forward the localhost syncthing dashboard on the server through ssh to view it from another device. I have been using this port forward flag for months and the server has been fine, so my freezes are likely unrelated to this error message.
Crontab contains the job * * * * * php /var/www/pterodactyl/artisan schedule:run >> /dev/null 2>&1 which I believe causes most of the root logins which you see in the logs.

dstat

I ran $ sudo dstat -tcdrgilmns to check if the freezing was due to a resource bottleneck. System, CPU, Memory, Disk, Network resource usage for every 5 seconds till the ‘freeze’ are in a colour coded .ods spreadsheet available at the shared folder link above.

What I’ve tried that hasn’t worked

$ sudo dnf update --refresh -y && sudo reboot
Booting memtest, it PASSED.
Running a SMART ‘long’ test on the internal SSD and HDD. Both tests were successful
Uninstalling kmod-nvidia (journald presumably crashes system · Issue #14478 · systemd/systemd · GitHub) and rebooting

What I’ve tried very recently (result currently unknown)

Disabled ‘postfix’ service as it shows up in the last few lines of journalctl before journalctl is presumably killed, and I haven’t finished setting it up.

I’ve run out of ideas and can’t really let it run a liveusb for a few days or burn and build the whole server as I depend on it as my own mini cloud.

Please advise.

Cphusion · July 8, 2023, 5:30pm

It sounds similar to what I had with my Proxmox server a while back it would stop responding and freeze after several hourse, I thought it was a motherboard that was failing then so I replaced it. Shortly after the same thing started happening again after several hours the system would freeze and stop responding. One time after I rebooted I decided to check the dmesg output and then I noticed a lot of input/output errors of the os disk. I replaced it and haven’t had issues since, so the first place I would check is the dmesg output to see if you have “input/output errors” for your os disk because those disk input/output errors didn’t show up in the syslog.

selfhost · July 8, 2023, 6:05pm

The server just froze again and my ssh session stopped responding so I rebooted, logged in and typed $ sudo dmesg as you suggested. However, I couldn’t find any input/output disk errors:

$ grep -Ei 'input|output|error|disk|sda|sdb' dmesg_2023-07-08_18-56_just-after-freeze-reboot.txt
[    0.000000] RAMDISK: [mem 0x5cbb7000-0x6003efff]
[    0.198211] VFS: Disk quotas dquot_6.6.0
[    0.717053] input: Sleep Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input0
[    0.717119] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input1
[    0.717182] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input2
[    0.939535] systemd[1]: Running in initial RAM disk.
[    1.174575] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input3
[    1.612227] sd 0:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/112 GiB)
[    1.612229] sd 1:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)
[    1.612871] sd 0:0:0:0: [sda] 4096-byte physical blocks
[    1.613516] sd 1:0:0:0: [sdb] 4096-byte physical blocks
[    1.613527] sd 1:0:0:0: [sdb] Write Protect is off
[    1.614254] sd 0:0:0:0: [sda] Write Protect is off
[    1.614739] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.615220] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.615688] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.615694] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.616934]  sda: sda1
[    1.617621] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.649048]  sdb: sdb1 sdb2 sdb3 sdb4
[    1.650410] sd 1:0:0:0: [sdb] Attached SCSI disk
[   24.781319] EXT4-fs (sdb3): mounted filesystem with ordered data mode. Opts: (null)
[   25.417367] printk: systemd: 18 output lines suppressed due to ratelimiting
[   29.514158] EXT4-fs (sdb3): re-mounted. Opts: (null)
[   29.686032] Adding 8388604k swap on /dev/sdb4.  Priority:-2 extents:1 across:8388604k FS
[   34.047225] input: PC Speaker as /devices/platform/pcspkr/input/input4
[   34.109364] input: Eee PC WMI hotkeys as /devices/platform/eeepc-wmi/input/input5
[   34.475998] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input6
[   34.476568] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input7
[   34.477097] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[   34.477589] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[   34.590276] snd_hda_codec_realtek hdaudioC0D0:    inputs:
[   34.608713] input: HDA Intel PCH Front Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[   34.609490] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[   34.610244] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[   34.611847] input: HDA Intel PCH Line Out as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[   34.612619] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
[   34.613474] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
[   34.614329] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
[   34.616453] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17
[   36.444913] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
[   36.656053] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Cphusion · July 8, 2023, 7:34pm

Try booting into an older kernel, one that the system was running on before the freeze problems started to happen?

selfhost · July 8, 2023, 9:39pm

I’m not sure if I still have a ‘good’ kernel installed as I have performed a few updates whilst troubleshooting and my system only keeps the three latest installed.

I’ve setup grub to use my oldest non-rescue kernel on the next boot, which is 4.18.0-425.19.2.el8_7.x86_64 at index 2 (sudo grubby --info ALL):

$ uname -r
4.18.0-477.15.1.el8_8.x86_64
$ sudo grub2-reboot 2
$ sudo shutdown -r now
$ uname -r
4.18.0-425.19.2.el8_7.x86_64

I will report back tomorrow with whether another freeze occurred or not. Thank you for your time and experience.

selfhost · July 9, 2023, 1:49pm

Unfortunately the ‘freeze’ still occurred on kernel 4.18.0-425.19.2.

However, I believe I captured TTY logs very early in the freeze which may be useful for diagnosis:

Code: Unable to access opcode bytes at RIP 0x...

Edit: Why does the watchdog not detect the ‘freeze’ and force a reboot?

selfhost · July 9, 2023, 9:47pm

A friend suggested disabling SWAP due to the above stacktrace. I realised I had significantly over-allocated SWAP to servers running behind Pterodactyl. I’ve now disabled SWAP usage on all Pterodactyl servers and currently at 8 hours uptime, SWAP usage is at 1.98MiB/8GiB (<1%).

I have OOM killer enabled on my Pterodactyl servers, but it seems that may only apply to real memory, not SWAP.

We’ll see how this configuration goes. If it works, that suggests that SWAP overflowing was causing my server to lock up. Perhaps there’s an underlying memory leak in one of the Pterodactyl servers, in which case OOM killer should prevent Rocky from freezing up as I haven’t over-allocated RAM.

Topic		Replies	Views
Rocky Linux 9 - OS freezes Rocky Linux Help & Support rocky-linux-9	10	2552	June 8, 2024
Rocky linux login freezing Rocky Linux Help & Support rocky-linux-8	5	455	July 9, 2024
Intermittent server reboot issue Rocky Linux Help & Support rocky-linux-8 , dell	3	367	April 21, 2024
Rocky v8.4 on Dell XPS 8940 is unstable! Rocky Linux Help & Support	13	2396	August 25, 2023
Blocking BUG with the new RockyLinux 9.2 kernel Rocky Linux Help & Support	23	8330	December 27, 2023