Rocky 8 server freezes after 3-10 hours, journalctl killed. Kernel/IO issue?

Hi Rocky community!

I have a multi-purpose linux server at home which has ran Fedora/Fedora Server/CentOS/Rocky over the years without any major issues.

However, over the past couple of weeks, in the afternoon every day the server appears to freeze/lock up. I suspect a kernel/IO error.

Current server configuration:

  • 100Mbps down, 10Mbps up CAT6 residential connection
  • 3TB HDD: /dev/sdb, contains /
  • 120GB SSD: /dev/sda, contains /mnt/SSD01_120G. Only stores MariaDB mysql databases
  • $ inxi -v4z:
    System:
      Kernel: 4.18.0-477.15.1.el8_8.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 8.5.0
        Console: pty pts/1 Distro: Rocky Linux release 8.8 (Green Obsidian) base: RHEL 8.8
    Machine:
      Type: Desktop product: PRIME Z270-P v: N/A serial: <superuser required>
      Mobo: ASUSTeK model: PRIME Z270-P v: Rev X.0x serial: <superuser required>
        UEFI: American Megatrends v: 1205 date: 05/11/2018
    CPU:
      Info: quad core model: Intel Core i7-7700K bits: 64 type: MT MCP arch: Kaby Lake rev: 9 cache:
        L1: 256 KiB L2: 1024 KiB L3: 8 MiB
      Speed (MHz): avg: 4400 min/max: 800/4500 cores: 1: 4400 2: 4400 3: 4400 4: 4400 5: 4400
        6: 4400 7: 4400 8: 4400 bogomips: 67200
      Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
    Graphics:
      Device-1: Intel HD Graphics 630 vendor: ASUSTeK driver: i915 v: kernel arch: Gen-9.5
        bus-ID: 00:02.0
      Device-2: NVIDIA GP107 [GeForce GTX 1050 Ti] driver: nouveau v: kernel arch: Pascal
        bus-ID: 01:00.0 temp: 36.0 C
      Display: web server: X.org v: 1.20.11 with: Xwayland v: 21.1.3 driver: X: loaded: nvidia
        unloaded: fbdev,modesetting,nouveau,vesa gpu: nouveau tty: 220x59 resolution: 1920x1080
      API: OpenGL Message: GL data unavailable in console. Try -G --display
    Network:
      Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASUSTeK PRIME B450M-A
        driver: r8169 v: kernel port: d000 bus-ID: 04:00.0
      IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
      IF-ID-1: br-6b782f5b73d5 state: up speed: 10000 Mbps duplex: unknown mac: <filter>
      ...
      IF-ID-4: docker0 state: down mac: <filter>
      IF-ID-5: pterodactyl0 state: up speed: 10000 Mbps duplex: unknown mac: <filter>
      IF-ID-6: veth049e9bb state: up speed: 10000 Mbps duplex: full mac: <filter>
      ...
      IF-ID-19: virbr0 state: down mac: <filter>
    Drives:
      Local Storage: total: 2.84 TiB used: 2.13 TiB (75.1%)
      ID-1: /dev/sda vendor: Kingston model: SUV400S37120G size: 111.79 GiB
      ID-2: /dev/sdb vendor: Toshiba model: HDWD130 size: 2.73 TiB
    Partition:
      ID-1: / size: 2.68 TiB used: 2.12 TiB (79.3%) fs: ext4 dev: /dev/sdb3
      ID-2: /boot size: 973.4 MiB used: 312.7 MiB (32.1%) fs: ext4 dev: /dev/sdb2
      ID-3: /boot/efi size: 499.7 MiB used: 5.8 MiB (1.2%) fs: vfat dev: /dev/sdb1
      ID-4: swap-1 size: 8 GiB used: 2.9 GiB (36.3%) fs: swap dev: /dev/sdb4
    Info:
      Processes: 317 Uptime: 6h 21m Memory: available: 31.27 GiB used: 18.32 GiB (58.6%) Init: systemd
      target: graphical (5) Compilers: gcc: 8.5.0 Packages: 51 note: see --rpm Shell: Zsh v: 5.5.1
      inxi: 3.3.27
    

Recent changes (chronological)

  • Enabled live kernel patching (updates) via cockpit
  • Half-installed postfix. Haven’t finished configuring and hardening. Not currently allowed through the firewall.
  • Changed certbot’s cronjob to execute at a more random time:
    0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && certbot renew
  • Setup 15 cronjobs to make an API request to point my domains to my current IP as my IP is dynamic:
    0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && curl "https://api-endpoint.com/?...
  • Setup fstab to mount the SSD on boot
    UUID=... /mnt/SSD01_120G ext4 defaults 1 2
  • Moved /var/lib/mysql to /mnt/SSD01_120G/var/lib/mysql and changed the appropriate socket configurations in mariadb, php-fpm etc
  • Setup fail2ban to protect ssh, nginx, sftp (finally)
  • Started hosting endlessh on port 22 using docker (actual ssh is on another port in the 1024+ range)
  • Disabled live kernel patching via cockpit, and uninstalled kpatch kpatch-dnf for good measure

Server freezing symptoms which usually occur in the afternoon:

  • My Minecraft servers still accept player joins, but players have very unstable connectivity issues, until the servers stop accepting joins altogether
  • My webservers are very slow, until they stop loading at all
  • When attempting to SSH into the server, my key is accepted and I’m prompted to enter my TOTP code, but once entered nothing happens (no output via ssh -vvvvv)
  • The disk usage light on the front of the PC tower flashes very periodically (usually it flashes multiple times per second)
  • Cannot login via a monitor/keyboard at the local tty screen. Upon entering my username, no password prompt appears, and then the login times out after 60 seconds
  • The only way to regain control of the server is to hard reboot it using the physical power button on the PC case

Logs

All logs available here: Shared Folder
Due to image, link and character limits imposed on new forum accounts, I have placed all logs into this shared folder. Please note that hardened browsers with JavaScript Just-In-Time disabled may take up to a minute to initially ‘decrypt’ the link.

tty login screen

A monitor connected to the server shows logs appearing on the tty login screen when the server is in the frozen/halted state: Please see shared folder link above

Journalctl

Upon regaining control of the server after a forced reboot, journalctl logs abruptly stop during the time where I experience the aforementioned symptoms, but well before I reboot:
Command used to gather logs:
$ sudo journalctl --since "2023-06-04" | grep -B50 -A3 "\-\- Reboot \-\-" | grep -iv "unit-which-logs-personal-info" | sed -e 's/info-to-redact/redacted/g'
Command output: Please see shared folder link above

Journalctl log clarification

  • Example of a healthy/intentional shutdown/reboot:

    Jul 04 22:10:36 host.domain.com systemd-journald[569]: Journal stopped
    -- Reboot --
    Jul 04 22:11:00 host.domain.com kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12
    

    Example of a system freeze where I manually force shutdown the machine via the power button several hours later:

    Jul 04 17:37:25 host.domain.com sshd[682493]: error: kex_exchange_identification: Connection closed by remote host
    -- Reboot --
    Jul 04 19:30:28 host kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12
    
  • The warning
    WARNING: Failed starting API: listen tcp 127.0.0.1:8384: bind: address already in use
    is likely caused by me using ssh user@host -L 8384:localhost:8384 to reverse port forward the localhost syncthing dashboard on the server through ssh to view it from another device. I have been using this port forward flag for months and the server has been fine, so my freezes are likely unrelated to this error message.

  • Crontab contains the job * * * * * php /var/www/pterodactyl/artisan schedule:run >> /dev/null 2>&1 which I believe causes most of the root logins which you see in the logs.

dstat

I ran $ sudo dstat -tcdrgilmns to check if the freezing was due to a resource bottleneck. System, CPU, Memory, Disk, Network resource usage for every 5 seconds till the ‘freeze’ are in a colour coded .ods spreadsheet available at the shared folder link above.

What I’ve tried that hasn’t worked

What I’ve tried very recently (result currently unknown)

  • Disabled ‘postfix’ service as it shows up in the last few lines of journalctl before journalctl is presumably killed, and I haven’t finished setting it up.

I’ve run out of ideas and can’t really let it run a liveusb for a few days or burn and build the whole server as I depend on it as my own mini cloud.

Please advise.

It sounds similar to what I had with my Proxmox server a while back it would stop responding and freeze after several hourse, I thought it was a motherboard that was failing then so I replaced it. Shortly after the same thing started happening again after several hours the system would freeze and stop responding. One time after I rebooted I decided to check the dmesg output and then I noticed a lot of input/output errors of the os disk. I replaced it and haven’t had issues since, so the first place I would check is the dmesg output to see if you have “input/output errors” for your os disk because those disk input/output errors didn’t show up in the syslog.

The server just froze again and my ssh session stopped responding so I rebooted, logged in and typed $ sudo dmesg as you suggested. However, I couldn’t find any input/output disk errors:

$ grep -Ei 'input|output|error|disk|sda|sdb' dmesg_2023-07-08_18-56_just-after-freeze-reboot.txt
[    0.000000] RAMDISK: [mem 0x5cbb7000-0x6003efff]
[    0.198211] VFS: Disk quotas dquot_6.6.0
[    0.717053] input: Sleep Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input0
[    0.717119] input: Power Button as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input1
[    0.717182] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input2
[    0.939535] systemd[1]: Running in initial RAM disk.
[    1.174575] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input3
[    1.612227] sd 0:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/112 GiB)
[    1.612229] sd 1:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)
[    1.612871] sd 0:0:0:0: [sda] 4096-byte physical blocks
[    1.613516] sd 1:0:0:0: [sdb] 4096-byte physical blocks
[    1.613527] sd 1:0:0:0: [sdb] Write Protect is off
[    1.614254] sd 0:0:0:0: [sda] Write Protect is off
[    1.614739] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.615220] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.615688] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.615694] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.616934]  sda: sda1
[    1.617621] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.649048]  sdb: sdb1 sdb2 sdb3 sdb4
[    1.650410] sd 1:0:0:0: [sdb] Attached SCSI disk
[   24.781319] EXT4-fs (sdb3): mounted filesystem with ordered data mode. Opts: (null)
[   25.417367] printk: systemd: 18 output lines suppressed due to ratelimiting
[   29.514158] EXT4-fs (sdb3): re-mounted. Opts: (null)
[   29.686032] Adding 8388604k swap on /dev/sdb4.  Priority:-2 extents:1 across:8388604k FS
[   34.047225] input: PC Speaker as /devices/platform/pcspkr/input/input4
[   34.109364] input: Eee PC WMI hotkeys as /devices/platform/eeepc-wmi/input/input5
[   34.475998] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input6
[   34.476568] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input7
[   34.477097] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input8
[   34.477589] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input9
[   34.590276] snd_hda_codec_realtek hdaudioC0D0:    inputs:
[   34.608713] input: HDA Intel PCH Front Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[   34.609490] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[   34.610244] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[   34.611847] input: HDA Intel PCH Line Out as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[   34.612619] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
[   34.613474] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
[   34.614329] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
[   34.616453] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17
[   36.444913] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
[   36.656053] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Try booting into an older kernel, one that the system was running on before the freeze problems started to happen?

I’m not sure if I still have a ‘good’ kernel installed as I have performed a few updates whilst troubleshooting and my system only keeps the three latest installed.

I’ve setup grub to use my oldest non-rescue kernel on the next boot, which is 4.18.0-425.19.2.el8_7.x86_64 at index 2 (sudo grubby --info ALL):

$ uname -r
4.18.0-477.15.1.el8_8.x86_64
$ sudo grub2-reboot 2
$ sudo shutdown -r now
$ uname -r
4.18.0-425.19.2.el8_7.x86_64

I will report back tomorrow with whether another freeze occurred or not. Thank you for your time and experience.

Unfortunately the ‘freeze’ still occurred on kernel 4.18.0-425.19.2.

However, I believe I captured TTY logs very early in the freeze which may be useful for diagnosis:

Code: Unable to access opcode bytes at RIP 0x...

Edit: Why does the watchdog not detect the ‘freeze’ and force a reboot?

A friend suggested disabling SWAP due to the above stacktrace. I realised I had significantly over-allocated SWAP to servers running behind Pterodactyl. I’ve now disabled SWAP usage on all Pterodactyl servers and currently at 8 hours uptime, SWAP usage is at 1.98MiB/8GiB (<1%).

I have OOM killer enabled on my Pterodactyl servers, but it seems that may only apply to real memory, not SWAP.

We’ll see how this configuration goes. If it works, that suggests that SWAP overflowing was causing my server to lock up. Perhaps there’s an underlying memory leak in one of the Pterodactyl servers, in which case OOM killer should prevent Rocky from freezing up as I haven’t over-allocated RAM.