'perf record' with stall counters on aarch64 / a64fx soft locks cores

We’re running Rocky 8.7 on Fujitsu A64FX (ARMv8) and trying to benchmark with the HPCToolkit, which assumes perf counter support. The perf support is stock.

20:41:25 root@compute104.godzilla:~ # grep -i version /etc/os-release
VERSION="8.7 (Green Obsidian)"

19:54:35 root@compute104.godzilla:~ # rpm -q perf opencsd slang

19:54:39 root@compute104.godzilla:~ # uname -a 
Linux compute104 4.18.0-425.19.2.el8_7.aarch64 #1 SMP Tue Apr 4 19:39:14 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Writing certain counters fails with 800b written (unparsable by ‘perf report’) and the cpu core associated with the task hung in soft lock on [perf-exec] (watchdog timer), which never returns and is uninterruptible. Expanding or disabling the watchdog reduces output but the core remains in soft lock for at least ~15 minutes. No other cores are impacted and the surrounding system continues responding as relatively healthy wrt memory, run queues, etc. Stacktracing with [v]child following from above perf-exec halts at perf-exec. According to the syslogs and auditlogs perf-exec never consumes cycles or changes wchan or state. The initial failure is intermittent; all failures have occurred in the first four immediately consecutive attempts with most occuring in the second attempt. This is not changed by writing to the default “perf.data” vs a named “-o” target, or by writing to NFS vs local nvme vs tmpfs vs ramfs. The minimum repeater is the combination of ‘record’ function and one of the indicated counters:

# The following can be run a minimum of 10 immediately consecutive times without issue:
perf stat -e <any combination of cpu-cycles, stalled-cycles-backend, stalled-cycles-frontend, ea_memory, ea_l2> id
perf record -e <any combination of cpu-cycles, ea_memory, ea_l2> id

# The following cause immediate soft lock on the executing core, typically but not always on the second attempt
perf record -e <any combination including stalled-cycles-frontend, stalled-cycles-backend, ea_core> id

Reading counters always succeeds. Repeat series of up to 10 consecutive ‘perf stat’, either manually and singly or via mpi at any scale/placement from 1 to 48 cores, has no failures capturing all counters.

No failures occur on AMD 7302P or Intel E5645. We do not have other ARM platforms on which to test.

Outstanding updates for the compute image as of now:

20:53:07 root@compute103.godzilla:~ # dnf update
Last metadata expiration check: 0:05:00 ago on Mon May  8 20:48:12 2023.
Dependencies resolved.
 Package                                        Architecture                          Version                                                           Repository                               Size
 emacs                                          aarch64                               1:26.1-7.el8_7.1                                                  appstream                               3.1 M
 emacs-common                                   aarch64                               1:26.1-7.el8_7.1                                                  appstream                                38 M
 emacs-filesystem                               noarch                                1:26.1-7.el8_7.1                                                  baseos                                   69 k
 kmod-yfs                                       aarch64                               2021.05-                                  auristor                                523 k
 libwebp                                        aarch64                               1.0.0-8.el8_7                                                     appstream                               245 k
 yfs                                            aarch64                               2021.05-27.el8                                                    auristor                                3.0 M
 yfs-client                                     aarch64                               2021.05-27.el8                                                    auristor                                 75 k
 yfs-dumptools                                  aarch64                               2021.05-27.el8                                                    auristor                                 45 k
 yfs-fuse                                       aarch64                               2021.05-27.el8                                                    auristor                                 16 k
 yfs-pam                                        aarch64                               2021.05-27.el8                                                    auristor                                 20 k

Transaction Summary
Upgrade  10 Packages

Total download size: 45 M
Is this ok [y/N]: n
Operation aborted.

Please re-run your tests using the latest version of Rocky Linux 8, which is 8.7 (and soon to be 8.8). 8.4 is no longer in support and there have been many bug fixes, enhancements, and security patches since then.

If you are able to reliably repeat this on 8.7, we can attempt to reproduce this issue on our own ARM hardware that we have available.

Release Version Guide.

Sorry for the confusion, the compute nodes were installed 8.4, they’re currently 8.7, edited…

Another traditionally-for-benchmarking-purposes related counter trio with the same issue, ea_memory and ea_l2 do not trigger the issue but ea_core does. OP updated.

Verified with 8.8 as well (kernel 4.18.0-425.19.2.el8_7.aarch64).

Ditto for kernel 4.18.0-477.10.1.el8_8.aarch64.

Ditto for the “Ookami” cluster at Stonybrook, their staff report users consistently taking down nodes while profiling, ie. with Forge MAP + perf.

Please attempt your tests on RHEL 8.8. If you cannot reliably reproduce the tests on RHEL with the same hardware configurations, this will be something we can further address ourselves.

You mean Rock 8.8, yes? The version that’s already documented in the OP?

No, Red Hat Enterprise Linux 8.8, which is what Rocky Linux 8.8 is based on. You can obtain a developer subscription for free and test that way.

Understood. FYI that will take some time, I’ll have to scare up a spare node with storage and physical console ports offsite. Onsite resources are unavailable, no storage, no cons. ETA for another armv8 platform is probably “well out”.