Hi Rocky community!
I have a multi-purpose linux server at home which has ran Fedora/Fedora Server/CentOS/Rocky over the years without any major issues.
However, over the past couple of weeks, in the afternoon every day the server appears to freeze/lock up. I suspect a kernel/IO error.
Current server configuration:
- 100Mbps down, 10Mbps up CAT6 residential connection
- 3TB HDD:
/dev/sdb
, contains/
- 120GB SSD:
/dev/sda
, contains/mnt/SSD01_120G
. Only stores MariaDB mysql databases $ inxi -v4z
:System: Kernel: 4.18.0-477.15.1.el8_8.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 8.5.0 Console: pty pts/1 Distro: Rocky Linux release 8.8 (Green Obsidian) base: RHEL 8.8 Machine: Type: Desktop product: PRIME Z270-P v: N/A serial: <superuser required> Mobo: ASUSTeK model: PRIME Z270-P v: Rev X.0x serial: <superuser required> UEFI: American Megatrends v: 1205 date: 05/11/2018 CPU: Info: quad core model: Intel Core i7-7700K bits: 64 type: MT MCP arch: Kaby Lake rev: 9 cache: L1: 256 KiB L2: 1024 KiB L3: 8 MiB Speed (MHz): avg: 4400 min/max: 800/4500 cores: 1: 4400 2: 4400 3: 4400 4: 4400 5: 4400 6: 4400 7: 4400 8: 4400 bogomips: 67200 Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx Graphics: Device-1: Intel HD Graphics 630 vendor: ASUSTeK driver: i915 v: kernel arch: Gen-9.5 bus-ID: 00:02.0 Device-2: NVIDIA GP107 [GeForce GTX 1050 Ti] driver: nouveau v: kernel arch: Pascal bus-ID: 01:00.0 temp: 36.0 C Display: web server: X.org v: 1.20.11 with: Xwayland v: 21.1.3 driver: X: loaded: nvidia unloaded: fbdev,modesetting,nouveau,vesa gpu: nouveau tty: 220x59 resolution: 1920x1080 API: OpenGL Message: GL data unavailable in console. Try -G --display Network: Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASUSTeK PRIME B450M-A driver: r8169 v: kernel port: d000 bus-ID: 04:00.0 IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter> IF-ID-1: br-6b782f5b73d5 state: up speed: 10000 Mbps duplex: unknown mac: <filter> ... IF-ID-4: docker0 state: down mac: <filter> IF-ID-5: pterodactyl0 state: up speed: 10000 Mbps duplex: unknown mac: <filter> IF-ID-6: veth049e9bb state: up speed: 10000 Mbps duplex: full mac: <filter> ... IF-ID-19: virbr0 state: down mac: <filter> Drives: Local Storage: total: 2.84 TiB used: 2.13 TiB (75.1%) ID-1: /dev/sda vendor: Kingston model: SUV400S37120G size: 111.79 GiB ID-2: /dev/sdb vendor: Toshiba model: HDWD130 size: 2.73 TiB Partition: ID-1: / size: 2.68 TiB used: 2.12 TiB (79.3%) fs: ext4 dev: /dev/sdb3 ID-2: /boot size: 973.4 MiB used: 312.7 MiB (32.1%) fs: ext4 dev: /dev/sdb2 ID-3: /boot/efi size: 499.7 MiB used: 5.8 MiB (1.2%) fs: vfat dev: /dev/sdb1 ID-4: swap-1 size: 8 GiB used: 2.9 GiB (36.3%) fs: swap dev: /dev/sdb4 Info: Processes: 317 Uptime: 6h 21m Memory: available: 31.27 GiB used: 18.32 GiB (58.6%) Init: systemd target: graphical (5) Compilers: gcc: 8.5.0 Packages: 51 note: see --rpm Shell: Zsh v: 5.5.1 inxi: 3.3.27
Recent changes (chronological)
- Enabled live kernel patching (updates) via cockpit
- Half-installed
postfix
. Haven’t finished configuring and hardening. Not currently allowed through the firewall. - Changed certbot’s cronjob to execute at a more random time:
0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && certbot renew
- Setup 15 cronjobs to make an API request to point my domains to my current IP as my IP is dynamic:
0 0,12 * * * python3 -c 'import random; import time; time.sleep(random.random() * 3600)' && curl "https://api-endpoint.com/?...
- Setup fstab to mount the SSD on boot
UUID=... /mnt/SSD01_120G ext4 defaults 1 2
- Moved
/var/lib/mysql
to/mnt/SSD01_120G/var/lib/mysql
and changed the appropriate socket configurations in mariadb, php-fpm etc - Setup fail2ban to protect ssh, nginx, sftp (finally)
- Started hosting endlessh on port 22 using docker (actual ssh is on another port in the 1024+ range)
- Disabled live kernel patching via cockpit, and uninstalled
kpatch kpatch-dnf
for good measure
Server freezing symptoms which usually occur in the afternoon:
- My Minecraft servers still accept player joins, but players have very unstable connectivity issues, until the servers stop accepting joins altogether
- My webservers are very slow, until they stop loading at all
- When attempting to SSH into the server, my key is accepted and I’m prompted to enter my TOTP code, but once entered nothing happens (no output via
ssh -vvvvv
) - The disk usage light on the front of the PC tower flashes very periodically (usually it flashes multiple times per second)
- Cannot login via a monitor/keyboard at the local tty screen. Upon entering my username, no password prompt appears, and then the login times out after 60 seconds
- The only way to regain control of the server is to hard reboot it using the physical power button on the PC case
Logs
All logs available here: Shared Folder
Due to image, link and character limits imposed on new forum accounts, I have placed all logs into this shared folder. Please note that hardened browsers with JavaScript Just-In-Time disabled may take up to a minute to initially ‘decrypt’ the link.
tty login screen
A monitor connected to the server shows logs appearing on the tty login screen when the server is in the frozen/halted state: Please see shared folder link above
Journalctl
Upon regaining control of the server after a forced reboot, journalctl logs abruptly stop during the time where I experience the aforementioned symptoms, but well before I reboot:
Command used to gather logs:
$ sudo journalctl --since "2023-06-04" | grep -B50 -A3 "\-\- Reboot \-\-" | grep -iv "unit-which-logs-personal-info" | sed -e 's/info-to-redact/redacted/g'
Command output: Please see shared folder link above
Journalctl log clarification
-
Example of a healthy/intentional shutdown/reboot:
Jul 04 22:10:36 host.domain.com systemd-journald[569]: Journal stopped -- Reboot -- Jul 04 22:11:00 host.domain.com kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12
Example of a system freeze where I manually force shutdown the machine via the power button several hours later:
Jul 04 17:37:25 host.domain.com sshd[682493]: error: kex_exchange_identification: Connection closed by remote host -- Reboot -- Jul 04 19:30:28 host kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12
-
The warning
WARNING: Failed starting API: listen tcp 127.0.0.1:8384: bind: address already in use
is likely caused by me usingssh user@host -L 8384:localhost:8384
to reverse port forward the localhost syncthing dashboard on the server through ssh to view it from another device. I have been using this port forward flag for months and the server has been fine, so my freezes are likely unrelated to this error message. -
Crontab contains the job
* * * * * php /var/www/pterodactyl/artisan schedule:run >> /dev/null 2>&1
which I believe causes most of theroot
logins which you see in the logs.
dstat
I ran $ sudo dstat -tcdrgilmns
to check if the freezing was due to a resource bottleneck. System, CPU, Memory, Disk, Network resource usage for every 5 seconds till the ‘freeze’ are in a colour coded .ods
spreadsheet available at the shared folder link above.
What I’ve tried that hasn’t worked
$ sudo dnf update --refresh -y && sudo reboot
- Booting
memtest
, it PASSED. - Running a SMART ‘long’ test on the internal SSD and HDD. Both tests were successful
- Uninstalling kmod-nvidia (journald presumably crashes system · Issue #14478 · systemd/systemd · GitHub) and rebooting
What I’ve tried very recently (result currently unknown)
- Disabled ‘postfix’ service as it shows up in the last few lines of journalctl before journalctl is presumably killed, and I haven’t finished setting it up.
I’ve run out of ideas and can’t really let it run a liveusb for a few days or burn and build the whole server as I depend on it as my own mini cloud.
Please advise.