We have a volume change block tracking kernel driver running on RockyLinux 8.9 (4.18.0-513.11.1.el8_9.x86_64). This driver is used for backup product. The driver crashes randomly after running for couple of weeks. The driver tries to allocate memory page from an interrupt handler, the kernel starts a swapping activity for that memory allocation request, resulting into putting the driver thread to sleep. But since the memory allocation call was made from interrupt handler the kernel crashes with error: ‘BUG: scheduling while atomic: swapper/8/0/0x00000100’
Stack trace:
[1809981.082982] schedule_timeout+0x281/0x320
[1809981.083123] ata_generic
[1809981.083307] ? tick_sched_do_timer+0x50/0x50
[1809981.083442] ata_piix
[1809981.083624] ? timerqueue_add+0x5d/0x70
[1809981.083762] libata
[1809981.083954] __down+0x9a/0x100
[1809981.084088] virtio_net
[1809981.084268] ? mempool_alloc+0x67/0x180
[1809981.084399] net_failover
[1809981.084576] down+0x43/0x50
[1809981.084706] virtio_console
[1809981.084887] cow_request+0x50/0x3e0 [hcpdriver]
[1809981.085012] serio_raw
[1809981.085182] generic_make_request_no_check+0xe1/0x330
[1809981.085306] virtio_scsi
[1809981.085471] ? get_swap_bio+0xf0/0xf0
[1809981.085593] failover
[1809981.085773] submit_bio+0x3c/0x160
[1809981.085900] fuse
[1809981.086071] __swap_writepage+0x126/0x410
[1809981.086191]
[1809981.086363] ? __frontswap_store+0x6e/0x100
**[1809981.086781] pageout+0xf8/0x300**
**[1809981.086925] shrink_page_list+0xa64/0xce0**
**[1809981.087060] shrink_inactive_list+0x1d3/0x440**
**[1809981.087195] shrink_lruvec+0x472/0x6c0**
**[1809981.087322] shrink_node+0x23f/0x710**
**[1809981.087451] do_try_to_free_pages+0xd8/0x4c0**
**[1809981.087567] try_to_free_pages+0xf3/0x1d0**
**[1809981.087674] __alloc_pages_slowpath+0x3e2/0xcd0**
[1809981.087776] __alloc_pages_nodemask+0x2e2/0x330
[1809981.087892] _spbm_set_bit+0xce/0xf0 [hcpdriver]
[1809981.087998] copy_block_endIO+0x106/0x140 [hcpdriver]
[1809981.088098] blk_update_request+0x224/0x380
Another observation: there plenty of available RAM on the server, but swap usage is pretty high:
# free -m
total used free shared buff/cache available
Mem: 31874 8567 1522 1084 21784 21037
Swap: 4095 3952 143
vmstat also shows huge si and so values.
# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 1385204 1298240 23908 21819588 0 1 327 89 1 4 13 2 85 0 0
2 0 1385204 1248284 23920 21820996 0 1 69 440 3891 2068 5 1 93 0 0
0 0 1385204 1398112 23936 21844788 0 0 142 3914 4606 2338 8 2 90 0 0
0 0 1385204 1467488 23948 21836300 1 0 19 2779 4586 3020 2 2 96 0 0
5 0 1385204 1012792 23960 21829296 1 0 325 1478 8017 4103 15 4 81 0 0
4 0 1385204 898440 23984 21832412 1 0 815 3369 10205 6203 13 5 81 0 0
2 0 1385204 607164 23996 21835268 0 0 316 876 8866 6159 12 4 83 0 0
6 0 1385204 554660 24008 21835652 2 0 277 2924 10144 6843 22 4 74 0 0
4 0 1392796 752560 23988 21473952 173 1498 498 3758 9383 5968 21 4 74 0 0
2 0 1392796 813024 23996 21477152 38 0 554 1504 7311 4731 14 3 83 0 0
1 0 1392796 948984 24012 21478472 2 0 69 1919 6331 3273 13 3 85 0 0
1 0 1392796 1199608 24024 21477056 58 0 314 1464 6425 3310 11 3 86 0 0
6 0 1392796 796268 24036 21501316 7 0 7042 714 11087 9705 22 4 74 1 0
5 0 1392796 841080 24044 21509028 3 0 2091 628 12856 10641 26 5 69 0 0
2 0 1392796 941220 24024 21508596 2 0 298 629 8456 4770 20 4 76 0 0
17 0 1392796 885560 24032 21510056 5 0 343 2141 9011 4256 18 4 78 0 0
7 0 1392796 751744 24012 21525492 3 0 1505 3122 9949 4485 23 4 73 0 0
4 0 1392796 943564 24020 21529588 2 0 1064 580 8037 3911 17 3 80 0 0
1 0 1392796 1292224 24028 21534248 4 0 941 614 7570 3748 12 3 85 0 0
0 0 1392796 1377832 24040 21539912 1 0 806 492 5995 3104 9 2 89 0 0
4 0 1392796 1423332 24044 21681100 3 0 28362 1280 7227 3536 11 2 87 0 0
0 0 1392796 1385924 24056 21683172 0 0 1627 1460 8594 4094 16 3 81 0 0
7 0 1392796 783432 24232 21701956 4 0 1977 2361 10333 3767 23 6 71 0 0
7 0 1392796 704344 24236 21692504 2 0 281 733 13335 4664 46 7 47 0 0
8 0 1396240 522628 24052 21596924 4 658 1911 2592 14787 5540 52 7 41 0 0
9 0 1399056 491696 23624 21504848 11 551 2494 2749 13363 5591 44 6 49 0 0
12 0 1401104 630448 23188 21409940 14 420 622 1842 13004 5947 42 5 53 0 0
When checked mariadb and named are using maximum swap space.
Pid Swap Usage Command
2975374 22596 kB httpd
2975432 22692 kB httpd
2975431 22980 kB httpd
3067599 25532 kB httpd
3067543 25564 kB httpd
3067598 25732 kB httpd
609 33144 kB systemd-journal
3049731 37852 kB httpd
2479641 39860 kB httpd
2131101 40576 kB spamd
2722 720508 kB mariadbd
3147 786352 kB named
---------------------------------
I’m pretty sure the kernel driver is not leaking memory. The driver is stable and is in field for several years. It protected same servers when they were running CentOS 7. When customer upgraded to RockyLinux 8.x, we started facing this crash issues:
‘BUG: scheduling while atomic: swapper/8/0/0x00000100’
Please point me the direction for further investigation.
Thanks,
~ Ashish K.