XFS Internal error: xfs_trans_cancel at line 957 of file fs/xfs/xfs_trans.c

GaoChunfeng · January 6, 2025, 6:59am

Hi all,
I am using xfs on rocky linux 8. I have three bind mount under same folder. I continue to meet the XFS internal error about each 24h. Everytime, I have to unmount, xfs-repair,mount to fix this. which will interrupt my service and lead to data loss.
Would anyone have any best practice or slution to fix this ? thanks.

Here is some basic info:
– xfs on /folder (1.5T total), with three bind mount on /folder/a (~ 440G), on /folder/b ( ~ 101G) , on /folder/c (~ 10G)
– os release: rocky linux 8, xfs version: xfsprogs.x86_64 5.0.0-12.e18
– error log:
Jan 2 13:52:05 hybrid01 kernel: XFS (vdc): Internal error xfs_trans_cancel at line 957 of file fs/xfs/xfs_trans.c. Caller xfs_free_file_space+0x174/0x280 [xfs]
Jan 2 13:52:05 hybrid01 kernel: CPU: 13 PID: 2337066 Comm: dir /sensorsdat Tainted: G W OE -------- - - 4.18.0-553.16.1.el8_10.x86_64 #1
Jan 2 13:52:05 hybrid01 kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
Jan 2 13:52:05 hybrid01 kernel: Call Trace:
Jan 2 13:52:05 hybrid01 kernel: dump_stack+0x41/0x60
Jan 2 13:52:05 hybrid01 kernel: xfs_trans_cancel+0xad/0x130 [xfs]
Jan 2 13:52:05 hybrid01 kernel: xfs_free_file_space+0x174/0x280 [xfs]
Jan 2 13:52:05 hybrid01 kernel: xfs_file_fallocate+0x14a/0x480 [xfs]
Jan 2 13:52:05 hybrid01 kernel: vfs_fallocate+0x140/0x280
Jan 2 13:52:05 hybrid01 kernel: ioctl_preallocate+0x93/0xc0
Jan 2 13:52:05 hybrid01 kernel: do_vfs_ioctl+0x626/0x690
Jan 2 13:52:05 hybrid01 kernel: ? syscall_trace_enter+0x1ff/0x2d0
Jan 2 13:52:05 hybrid01 kernel: ksys_ioctl+0x64/0xa0
Jan 2 13:52:05 hybrid01 kernel: __x64_sys_ioctl+0x16/0x20
Jan 2 13:52:05 hybrid01 kernel: do_syscall_64+0x5b/0x1a0
Jan 2 13:52:05 hybrid01 kernel: entry_SYSCALL_64_after_hwframe+0x66/0xcb
Jan 2 13:52:05 hybrid01 kernel: RIP: 0033:0x7f17bb01522b
Jan 2 13:52:05 hybrid01 kernel: Code: 73 01 c3 48 8b 0d 5d 6c 39 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 6c 39 00 f7 d8 64 89 01 48
Jan 2 13:52:05 hybrid01 kernel: RSP: 002b:00007f179dd70138 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 2 13:52:05 hybrid01 kernel: RAX: ffffffffffffffda RBX: 00000000222d5ac0 RCX: 00007f17bb01522b
Jan 2 13:52:05 hybrid01 kernel: RDX: 00007f179dd70190 RSI: 000000004030582b RDI: 00000000000098e2
Jan 2 13:52:05 hybrid01 kernel: RBP: 00007f179dd701f0 R08: 864fd6762cc61123 R09: 0000000068f54c02
Jan 2 13:52:05 hybrid01 kernel: R10: 0000000000000050 R11: 0000000000000246 R12: 0000000006f93000
Jan 2 13:52:05 hybrid01 kernel: R13: 0000000000003000 R14: 0000000003c23624 R15: 00007f179dd702d8
Jan 2 13:52:05 hybrid01 kernel: XFS (vdc): Internal error xfs_trans_cancel at line 957 of file fs/xfs/xfs_trans.c. Caller xfs_free_file_space+0x174/0x280 [xfs]
Jan 2 13:52:05 hybrid01 kernel: CPU: 1 PID: 2337071 Comm: dir /sensorsdat Tainted: G W OE -------- - - 4.18.0-553.16.1.el8_10.x86_64 #1
Jan 2 13:52:05 hybrid01 kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
Jan 2 13:52:05 hybrid01 kernel: Call Trace:
Jan 2 13:52:05 hybrid01 kernel: dump_stack+0x41/0x60
Jan 2 13:52:05 hybrid01 kernel: xfs_trans_cancel+0xad/0x130 [xfs]
Jan 2 13:52:05 hybrid01 kernel: xfs_free_file_space+0x174/0x280 [xfs]
Jan 2 13:52:05 hybrid01 kernel: xfs_file_fallocate+0x14a/0x480 [xfs]
Jan 2 13:52:05 hybrid01 kernel: ? futex_wake+0x144/0x160
Jan 2 13:52:05 hybrid01 kernel: vfs_fallocate+0x140/0x280
Jan 2 13:52:05 hybrid01 kernel: ioctl_preallocate+0x93/0xc0
Jan 2 13:52:05 hybrid01 kernel: do_vfs_ioctl+0x626/0x690
Jan 2 13:52:05 hybrid01 kernel: ? syscall_trace_enter+0x1ff/0x2d0
Jan 2 13:52:05 hybrid01 kernel: ksys_ioctl+0x64/0xa0
Jan 2 13:52:05 hybrid01 kernel: __x64_sys_ioctl+0x16/0x20
Jan 2 13:52:05 hybrid01 kernel: do_syscall_64+0x5b/0x1a0
Jan 2 13:52:05 hybrid01 kernel: entry_SYSCALL_64_after_hwframe+0x66/0xcb
Jan 2 13:52:05 hybrid01 kernel: RIP: 0033:0x7f17bb01522b
Jan 2 13:52:05 hybrid01 kernel: Code: 73 01 c3 48 8b 0d 5d 6c 39 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 6c 39 00 f7 d8 64 89 01 48
Jan 2 13:52:05 hybrid01 kernel: RSP: 002b:00007f17b3fae138 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 2 13:52:05 hybrid01 kernel: RAX: ffffffffffffffda RBX: 0000000018ff49e0 RCX: 00007f17bb01522b
Jan 2 13:52:05 hybrid01 kernel: RDX: 00007f17b3fae190 RSI: 000000004030582b RDI: 00000000000071a6
Jan 2 13:52:05 hybrid01 kernel: RBP: 00007f17b3fae1f0 R08: ad732b924664055e R09: 00000000bb9703a7
Jan 2 13:52:05 hybrid01 kernel: R10: 0000000000000050 R11: 0000000000000246 R12: 0000000007be1000
Jan 2 13:52:05 hybrid01 kernel: R13: 0000000000003000 R14: 0000000003c23624 R15: 00007f17b3fae2d8
Jan 2 13:52:05 hybrid01 kernel: XFS (vdc): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0xc6/0x130 [xfs] (fs/xfs/xfs_trans.c:958). Shutting down filesystem
Jan 2 13:52:05 hybrid01 kernel: XFS (vdc): Please unmount the filesystem and rectify the problem(s)
Jan 2 13:52:05 hybrid01 systemd[1]: Started Process Core Dump (PID 2337089/UID 0).
Jan 2 13:52:05 hybrid01 systemd-coredump[2337090]: Resource limits disable core dumping for process 3179881 (replica_server).
Jan 2 13:52:05 hybrid01 systemd-coredump[2337090]: Process 3179881 (replica_server) of user 7001 dumped core.

nazunalika · January 6, 2025, 7:24am

There’s two ways to work through this: Software and hardware. Hardware you can’t look into because it’s the cloud. However, you can dig into the filesystem a bit to try to weed out issues.

You’ve given information on how much storage that is being taken up in each mount point, but not the filesystem sizes themselves. You’ve also not said if this issue is happening with every mount point, and not just one (for example, the largest one 1.5T would likely have a higher chance of presenting issues).

Jan 2 13:52:05 hybrid01 kernel: XFS (vdc): Internal error xfs_trans_cancel at line 957 of file fs/xfs/xfs_trans.c. Caller xfs_free_file_space+0x174/0x280 [xfs]

This is talking about free file space. You will need to confirm that you actually have space. Check df -h and xfs_info /mountpoint. You can optionally check df -i and check for inode usage.

[root@router ~]# df -h /var
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/rockyvg-var  150G   96G   55G  64% /var
[root@router ~]# df -i /var
Filesystem                Inodes  IUsed    IFree IUse% Mounted on
/dev/mapper/rockyvg-var 78643200 302583 78340617    1% /var
[root@router ~]# xfs_info /var
meta-data=/dev/mapper/rockyvg-var isize=512    agcount=4, agsize=9830400 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=39321600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=19200, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

You could also verify if there’s fragmentation.

[root@router ~]# xfs_db -c frag -r /dev/rockyvg/var
actual 273084, ideal 260529, fragmentation factor 4.60%
Note, this number is largely meaningless.
Files on this filesystem average 1.05 extents per file

If you have a large amount of fragmentation, you can use xfs_fsr /mountpoint.

I would start with those basic steps first and go from there.

GaoChunfeng · January 6, 2025, 8:42am

thank you for the reply.
Here are the further information:

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vdc 1.5T 552G 949G 37% /sensorsmounts/hybriddata

df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/vdc 157286400 101327 157185073 1% /sensorsmounts/hybriddata

xfs_info /dev/vdc
meta-data=/dev/vdc isize=512 agcount=4, agsize=98304000 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=0 inobtcount=0
data = bsize=4096 blocks=393216000, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=192000, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

xfs_db -c frag -r /dev/vdc
actual 1264007, ideal 484698, fragmentation factor 61.65%
Note, this number is largely meaningless.
Files on this filesystem average 2.61 extents per file

====
seems the fragmentation factor is too high.

iwalker · January 6, 2025, 9:29am

Then you want to be running the command to defragment that you were given by @nazunalika

xfs_fsr /mountpoint

for wherever /dev/vdc is mounted.

gerry666uk · January 6, 2025, 7:49pm

When you ran xfs_repair, did you save the console output each time, and do you have the most recent?

XFS Internal error: xfs_trans_cancel at line 957 of file fs/xfs/xfs_trans.c

thank you for the reply. Here are the further information:

thank you for the reply.
Here are the further information: