Stability issues with SAS2008 (LSI 9211-8i) controllers?

My old (core i5-750; that’s how old it is) not-a-server is running CentOS 6. I figured it time to upgrade, and since Rocky 8 has been working well in a few VMs running on that host I decided that would be the path to go.

Now the first issue I hit was my disk controllers are no longer supported by RH. Fortunately this is a common issue (since they’re effectively the same as some of the older Dell PERC controllers) and ELrepo mpt3sas driver still includes the mpt2sas stuff needed.

So far so good.

And the machine ran happily for 2 days.

And then froze solid. Nothing appeared on the console, nothing in the logs, it wouldn’t respond to any key presses. Basically solid-wedge.

Hard power cycle, everything came back up. 3 hours later another wedge. Repeat… and again 3 hours later, wedge.

The only thing I can think of that was different between the 2 days of reliability and the 3 wedges is that I was doing intensive I/O, rsync’ing level0 backups from a RAID-6 on the SAS2008 controllers to a RAID0 on two external USB disks.

The first hang happened 10 minutes after the cron-job started. After a reboot I restarted the rsync. And again on the third time.

After three hangs I reverted the system back to Centos 6 (I’d got new SSDs for Rocky 8 so it was just an SSD swap) and then the rsync completed without error (which it had been doing for the past 2+ years, and even longer on different external disks).

So my gut feeling is that the mpt2sas component of the mpt3sas driver in ELrepo isn’t quite stable and hangs during times of heavy I/O.

Has anyone else seen this?

You might want to test-install kernel-ml from ELRepo.

The mpt3sas module in el-8.5 is 37.101.00.00. That in the current kernel-ml (5.15.x) is 39.100.00.00. So it may be worth a try.

How about trying to do high IO without the USB disks - just within the SAS2008?

Fingers crossed… we’ll see how stable this is!

% uname -sr ; dmesg | grep mpt3sas | head -1
Linux 5.15.6-1.el8.elrepo.x86_64
[    1.868474] mpt3sas version 39.100.00.00 loaded

Thanks for the tip.

So far so good; it’s been up over a day and performed the weekly backup process without a hitch.

Now whether that’s the mpt3sas driver or some other change between RH-4.18 and mainline 5.15 kernels I can’t tell!

That’s great news.

It could be due to the newer version of the mpt3sas driver, but you’re right, that’s hard to tell.

A second busy weekend succeeded with no hangs. Yay!

Maybe related, maybe a coincidence…

On Saturday one of the drives in the RAID6 started to fail, and eventually dropped out of the array with unrecoverable errors. It may have been the power cycling while I was doing work may have stressed it ('cos it was 7.5 years old!).

But it makes me wonder if it might have been playing up earlier and triggered an issue in the older driver (causing a hang) that the newer one correctly handled.

Replacing the disk caused a lot of I/O (since it’s 8*4TB in RAID6); continuous 60MB/s rebuild speed (as reported by mdstat)for 18 hours, and no problems reported. So that’s also another sign of stability with this kernel.