RockyLinux 8.9 kernel driver crashing

We have a volume change block tracking kernel driver running on RockyLinux 8.9 (4.18.0-513.11.1.el8_9.x86_64). This driver is used for backup product. The driver crashes randomly after running for couple of weeks. The driver tries to allocate memory page from an interrupt handler, the kernel starts a swapping activity for that memory allocation request, resulting into putting the driver thread to sleep. But since the memory allocation call was made from interrupt handler the kernel crashes with error: ‘BUG: scheduling while atomic: swapper/8/0/0x00000100’

Stack trace:

[1809981.082982]  schedule_timeout+0x281/0x320
[1809981.083123]  ata_generic
[1809981.083307]  ? tick_sched_do_timer+0x50/0x50
[1809981.083442]  ata_piix
[1809981.083624]  ? timerqueue_add+0x5d/0x70
[1809981.083762]  libata
[1809981.083954]  __down+0x9a/0x100
[1809981.084088]  virtio_net
[1809981.084268]  ? mempool_alloc+0x67/0x180
[1809981.084399]  net_failover
[1809981.084576]  down+0x43/0x50
[1809981.084706]  virtio_console
[1809981.084887]  cow_request+0x50/0x3e0 [hcpdriver]
[1809981.085012]  serio_raw
[1809981.085182]  generic_make_request_no_check+0xe1/0x330
[1809981.085306]  virtio_scsi
[1809981.085471]  ? get_swap_bio+0xf0/0xf0
[1809981.085593]  failover
[1809981.085773]  submit_bio+0x3c/0x160
[1809981.085900]  fuse
[1809981.086071]  __swap_writepage+0x126/0x410
[1809981.086191]
[1809981.086363]  ? __frontswap_store+0x6e/0x100
**[1809981.086781]  pageout+0xf8/0x300**
**[1809981.086925]  shrink_page_list+0xa64/0xce0**
**[1809981.087060]  shrink_inactive_list+0x1d3/0x440**
**[1809981.087195]  shrink_lruvec+0x472/0x6c0**
**[1809981.087322]  shrink_node+0x23f/0x710**
**[1809981.087451]  do_try_to_free_pages+0xd8/0x4c0**
**[1809981.087567]  try_to_free_pages+0xf3/0x1d0**
**[1809981.087674]  __alloc_pages_slowpath+0x3e2/0xcd0**
[1809981.087776]  __alloc_pages_nodemask+0x2e2/0x330
[1809981.087892]  _spbm_set_bit+0xce/0xf0 [hcpdriver]
[1809981.087998]  copy_block_endIO+0x106/0x140 [hcpdriver]
[1809981.088098]  blk_update_request+0x224/0x380

Another observation: there plenty of available RAM on the server, but swap usage is pretty high:

# free -m
              total        used        free      shared  buff/cache   available
Mem:          31874        8567        1522        1084       21784       21037
Swap:          4095        3952         143

vmstat also shows huge si and so values.

# vmstat 5

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 1385204 1298240  23908 21819588    0    1   327    89    1    4 13  2 85  0  0
 2  0 1385204 1248284  23920 21820996    0    1    69   440 3891 2068  5  1 93  0  0
 0  0 1385204 1398112  23936 21844788    0    0   142  3914 4606 2338  8  2 90  0  0
 0  0 1385204 1467488  23948 21836300    1    0    19  2779 4586 3020  2  2 96  0  0
 5  0 1385204 1012792  23960 21829296    1    0   325  1478 8017 4103 15  4 81  0  0
 4  0 1385204 898440  23984 21832412    1    0   815  3369 10205 6203 13  5 81  0  0
 2  0 1385204 607164  23996 21835268    0    0   316   876 8866 6159 12  4 83  0  0
 6  0 1385204 554660  24008 21835652    2    0   277  2924 10144 6843 22  4 74  0  0
 4  0 1392796 752560  23988 21473952  173 1498   498  3758 9383 5968 21  4 74  0  0
 2  0 1392796 813024  23996 21477152   38    0   554  1504 7311 4731 14  3 83  0  0
 1  0 1392796 948984  24012 21478472    2    0    69  1919 6331 3273 13  3 85  0  0
 1  0 1392796 1199608  24024 21477056   58    0   314  1464 6425 3310 11  3 86  0  0
 6  0 1392796 796268  24036 21501316    7    0  7042   714 11087 9705 22  4 74  1  0
 5  0 1392796 841080  24044 21509028    3    0  2091   628 12856 10641 26  5 69  0  0
 2  0 1392796 941220  24024 21508596    2    0   298   629 8456 4770 20  4 76  0  0
17  0 1392796 885560  24032 21510056    5    0   343  2141 9011 4256 18  4 78  0  0
 7  0 1392796 751744  24012 21525492    3    0  1505  3122 9949 4485 23  4 73  0  0
 4  0 1392796 943564  24020 21529588    2    0  1064   580 8037 3911 17  3 80  0  0
 1  0 1392796 1292224  24028 21534248    4    0   941   614 7570 3748 12  3 85  0  0
 0  0 1392796 1377832  24040 21539912    1    0   806   492 5995 3104  9  2 89  0  0
 4  0 1392796 1423332  24044 21681100    3    0 28362  1280 7227 3536 11  2 87  0  0
 0  0 1392796 1385924  24056 21683172    0    0  1627  1460 8594 4094 16  3 81  0  0
 7  0 1392796 783432  24232 21701956    4    0  1977  2361 10333 3767 23  6 71  0  0
 7  0 1392796 704344  24236 21692504    2    0   281   733 13335 4664 46  7 47  0  0
 8  0 1396240 522628  24052 21596924    4  658  1911  2592 14787 5540 52  7 41  0  0
 9  0 1399056 491696  23624 21504848   11  551  2494  2749 13363 5591 44  6 49  0  0
12  0 1401104 630448  23188 21409940   14  420   622  1842 13004 5947 42  5 53  0  0

When checked mariadb and named are using maximum swap space.
      Pid      Swap Usage  Command
     2975374        22596 kB  httpd
     2975432        22692 kB  httpd
     2975431        22980 kB  httpd
     3067599        25532 kB  httpd
     3067543        25564 kB  httpd
     3067598        25732 kB  httpd
         609        33144 kB  systemd-journal
     3049731        37852 kB  httpd
     2479641        39860 kB  httpd
     2131101        40576 kB  spamd
        2722       720508 kB  mariadbd
        3147       786352 kB  named
---------------------------------

I’m pretty sure the kernel driver is not leaking memory. The driver is stable and is in field for several years. It protected same servers when they were running CentOS 7. When customer upgraded to RockyLinux 8.x, we started facing this crash issues:
‘BUG: scheduling while atomic: swapper/8/0/0x00000100’

Please point me the direction for further investigation.

Thanks,
~ Ashish K.

Well, Rocky 8 isn’t CentOS 7, so there is still every possibility that the driver/kernel module is a potential problem. How stable is the server using it without this module loaded? You would need to unload the kernel module/driver, and then see how the server performs. If it works, that mean the driver is the problem and you need to raise the problem with the vendor of that driver/module.

see also
https://stackoverflow.com/questions/3537252/how-to-solve-bug-scheduling-while-atomic-swapper-0x00000103-0-cpu0-in-ts

1 Like