Gluster self-heal daemon wont start

fhon · April 29, 2023, 10:15pm

Hello,

We’re setting up a test environment with just 2 gluster servers before we commit to buying more equipment. The servers (test-phys1 and test-phys2) appear to be working properly from the client perspective but for the life of me I cant figure out why the the Self-heal daemon wont start on either server.

I’ve done a ton of googling but can’t find an answer. The only real error I see in the logs is:
Dict get failed [{Key=cluster.server-quorum-type}, {errno=2}, {error=No such file or directory}]
I’m not sure how big of a deal that is. I know just having two servers could cause split brain issues, but I didn’t think it would affect the Self-heal daemon.

Any help would be greatly appreciated.

[root@test-phys1 ~]# gluster peer status
Number of Peers: 1

Hostname: test-phys2
Uuid: a629951e-716d-4288-b9ba-45da1e9b4713
State: Peer in Cluster (Connected)

[root@test-phys1 ~]# gluster volume status
Status of volume: volume1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick test-phys1:/data/glusterfs/volume1/bri
ck0                                         59621     0          Y       6652 
Brick test-phys2:/data/glusterfs/volume1/bri
ck0                                         54040     0          Y       2965 
Self-heal Daemon on localhost               N/A       N/A        N       N/A  
Self-heal Daemon on test-phys2               N/A       N/A        N       N/A  
 
Task Status of Volume volume1
------------------------------------------------------------------------------
There are no active volume tasks

cat /var/log/glusterfs/glustershd.log
....
[2023-04-29 21:36:13.715667 +0000] I [MSGID: 100030] [glusterfsd.c:2769:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=10.0}, {cmdlinestr=/usr/sbin/glusterfs -s localhost --volfile-id shd/volume1 -p /var/run/gluster/shd/volume1/volume1-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9d4d25f5a390def8.socket --xlator-option *replicate*.node-uuid=b5961bb7-68ea-4982-8aac-0fb3164d213a --process-name glustershd --client-pid=-6}] 
[2023-04-29 21:36:13.716915 +0000] I [glusterfsd.c:2448:daemonize] 0-glusterfs: Pid of current running process is 6961
[2023-04-29 21:36:13.718869 +0000] W [MSGID: 101248] [gf-io-uring.c:406:gf_io_uring_setup] 0-io: Current kernel doesn't support I/O URing interface. [Function not implemented]
[2023-04-29 21:36:13.725150 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 11
[2023-04-29 21:36:13.731814 +0000] I [MSGID: 101190] [event-epoll.c:668:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2023-04-29 21:36:13.731892 +0000] I [MSGID: 101190] [event-epoll.c:668:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}] 
[2023-04-29 21:36:13.736684 +0000] I [glusterfsd-mgmt.c:2676:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2023-04-29 21:36:13.736717 +0000] I [glusterfsd-mgmt.c:2713:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2023-04-29 21:36:13.736880 +0000] W [glusterfsd.c:1459:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xfb4b) [0x7fdefed92b4b] -->/usr/sbin/glusterfs(+0x14611) [0x55a1dea14611] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x55a1dea087e8] ) 0-: received signum (1), shutting down 
[2023-04-29 21:36:14.648603 +0000] I [MSGID: 100030] [glusterfsd.c:2769:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=10.0}, {cmdlinestr=/usr/sbin/glusterfs -s localhost --volfile-id shd/volume1 -p /var/run/gluster/shd/volume1/volume1-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9d4d25f5a390def8.socket --xlator-option *replicate*.node-uuid=b5961bb7-68ea-4982-8aac-0fb3164d213a --process-name glustershd --client-pid=-6}] 
[2023-04-29 21:36:14.649825 +0000] I [glusterfsd.c:2448:daemonize] 0-glusterfs: Pid of current running process is 6979
[2023-04-29 21:36:14.652339 +0000] W [MSGID: 101248] [gf-io-uring.c:406:gf_io_uring_setup] 0-io: Current kernel doesn't support I/O URing interface. [Function not implemented]
[2023-04-29 21:36:14.658736 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 11
[2023-04-29 21:36:14.665377 +0000] I [MSGID: 101190] [event-epoll.c:668:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2023-04-29 21:36:14.665472 +0000] I [MSGID: 101190] [event-epoll.c:668:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}] 
[2023-04-29 21:36:14.668680 +0000] I [glusterfsd-mgmt.c:2676:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2023-04-29 21:36:14.668713 +0000] I [glusterfsd-mgmt.c:2713:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2023-04-29 21:36:14.668886 +0000] W [glusterfsd.c:1459:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0xfb4b) [0x7f50c3e97b4b] -->/usr/sbin/glusterfs(+0x14611) [0x5642c7414611] -->/usr/sbin/glusterfs(cleanup_and_exit+0x58) [0x5642c74087e8] ) 0-: received signum (1), shutting down

cat /var/log/glusterfs/glusterfsd.log
.......
[2023-04-29 21:36:13.656047 +0000] I [MSGID: 100030] [glusterfsd.c:2769:main] 0-/usr/sbin/glusterd: Started running version [{arg=/usr/sbin/glusterd}, {version=10.0}, {cmdlinestr=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO}]
[2023-04-29 21:36:13.657220 +0000] I [glusterfsd.c:2448:daemonize] 0-glusterfs: Pid of current running process is 6940
[2023-04-29 21:36:13.659218 +0000] W [MSGID: 101248] [gf-io-uring.c:406:gf_io_uring_setup] 0-io: Current kernel doesn't support I/O URing interface. [Function not implemented]
[2023-04-29 21:36:13.662753 +0000] I [MSGID: 106478] [glusterd.c:1472:init] 0-management: Maximum allowed open file descriptors set to 65536
[2023-04-29 21:36:13.662823 +0000] I [MSGID: 106479] [glusterd.c:1548:init] 0-management: Using /var/lib/glusterd as working directory
[2023-04-29 21:36:13.662838 +0000] I [MSGID: 106479] [glusterd.c:1554:init] 0-management: Using /var/run/gluster as pid file working directory
[2023-04-29 21:36:13.668373 +0000] I [socket.c:974:__socket_server_bind] 0-socket.management: process started listening on port (24007)
[2023-04-29 21:36:13.671087 +0000] I [socket.c:916:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 14
[2023-04-29 21:36:13.671835 +0000] I [MSGID: 106059] [glusterd.c:1940:init] 0-management: max-port override: 60999
[2023-04-29 21:36:13.673326 +0000] I [MSGID: 106228] [glusterd.c:486:glusterd_check_gsync_present] 0-glusterd: geo-replication module not installed in the system [No such file or directory]
[2023-04-29 21:36:13.673475 +0000] I [MSGID: 106513] [glusterd-store.c:2124:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 100000
[2023-04-29 21:36:13.673799 +0000] W [MSGID: 106204] [glusterd-store.c:3163:glusterd_store_update_volinfo] 0-management: Unknown key: tier-enabled
[2023-04-29 21:36:13.673861 +0000] W [MSGID: 106204] [glusterd-store.c:3163:glusterd_store_update_volinfo] 0-management: Unknown key: brick-0
[2023-04-29 21:36:13.673875 +0000] W [MSGID: 106204] [glusterd-store.c:3163:glusterd_store_update_volinfo] 0-management: Unknown key: brick-1
[2023-04-29 21:36:13.674093 +0000] I [MSGID: 106544] [glusterd.c:153:glusterd_uuid_init] 0-management: retrieved UUID: b5961bb7-68ea-4982-8aac-0fb3164d213a
[2023-04-29 21:36:13.676144 +0000] I [MSGID: 106498] [glusterd-handler.c:3632:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2023-04-29 21:36:13.676228 +0000] W [MSGID: 106061] [glusterd-handler.c:3425:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2023-04-29 21:36:13.676263 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2023-04-29 21:36:13.684467 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:258:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}, {errno=2}, {error=No such file or directory}]
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option max-port 60999
  9:     option event-threads 1
 10:     option ping-timeout 0
 11:     option transport.socket.listen-port 24007
 12:     option transport.socket.read-fail-log off
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-time 10
 15:     option transport-type socket
 16:     option working-directory /var/lib/glusterd
 17: end-volume
 18:

+------------------------------------------------------------------------------+
[2023-04-29 21:36:13.684640 +0000] I [glusterd-utils.c:6914:glusterd_brick_start] 0-management: starting a fresh brick process for brick /data/glusterfs/volume1/brick0
[2023-04-29 21:36:13.685895 +0000] I [MSGID: 101190] [event-epoll.c:668:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2023-04-29 21:36:13.687891 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2023-04-29 21:36:13.692304 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2023-04-29 21:36:13.692554 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: quotad already stopped 
[2023-04-29 21:36:13.692589 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:266:glusterd_svc_stop] 0-management: quotad service is stopped 
[2023-04-29 21:36:13.692647 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2023-04-29 21:36:13.692861 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already stopped 
[2023-04-29 21:36:13.692888 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:266:glusterd_svc_stop] 0-management: bitd service is stopped 
[2023-04-29 21:36:13.692943 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2023-04-29 21:36:13.693167 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already stopped 
[2023-04-29 21:36:13.693202 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:266:glusterd_svc_stop] 0-management: scrub service is stopped 
[2023-04-29 21:36:13.693270 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2023-04-29 21:36:13.693498 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2023-04-29 21:36:13.695705 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2023-04-29 21:36:13.695918 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: glustershd already stopped 
[2023-04-29 21:36:13.721537 +0000] I [MSGID: 106496] [glusterd-handshake.c:955:__server_getspec] 0-management: Received mount request for volume volume1.test-phys1.data-glusterfs-volume1-brick0 
[2023-04-29 21:36:14.620570 +0000] I [MSGID: 106496] [glusterd-handshake.c:955:__server_getspec] 0-management: Received mount request for volume /volume1 
[2023-04-29 21:36:14.621797 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:470:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: a629951e-716d-4288-b9ba-45da1e9b4713, host: test-phys2, port: 0 
[2023-04-29 21:36:14.625348 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:258:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}, {errno=2}, {error=No such file or directory}] 
[2023-04-29 21:36:14.625493 +0000] I [glusterd-utils.c:6814:glusterd_brick_start] 0-management: discovered already-running brick /data/glusterfs/volume1/brick0
[2023-04-29 21:36:14.627965 +0000] I [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2023-04-29 21:36:14.625616 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:258:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}, {errno=2}, {error=No such file or directory}] 
[2023-04-29 21:36:14.631468 +0000] I [MSGID: 106492] [glusterd-handler.c:2730:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a629951e-716d-4288-b9ba-45da1e9b4713 
[2023-04-29 21:36:14.631558 +0000] I [MSGID: 106502] [glusterd-handler.c:2777:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend 
[2023-04-29 21:36:14.631677 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:683:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a629951e-716d-4288-b9ba-45da1e9b4713 
[2023-04-29 21:36:14.632008 +0000] I [MSGID: 106163] [glusterd-handshake.c:1502:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 100000 
[2023-04-29 21:36:14.633478 +0000] I [MSGID: 106490] [glusterd-handler.c:2547:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a629951e-716d-4288-b9ba-45da1e9b4713 
[2023-04-29 21:36:14.633691 +0000] E [MSGID: 106061] [glusterd-utils.c:5301:glusterd_get_global_server_quorum_ratio] 0-management: Dict get failed [{Key=cluster.server-quorum-ratio}, {errno=2}, {error=No such file or directory}] 
[2023-04-29 21:36:14.633717 +0000] E [MSGID: 106061] [glusterd-utils.c:5301:glusterd_get_global_server_quorum_ratio] 0-management: Dict get failed [{Key=cluster.server-quorum-ratio}, {errno=2}, {error=No such file or directory}] 
[2023-04-29 21:36:14.634285 +0000] I [MSGID: 106493] [glusterd-handler.c:3821:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to test-phys2 (0), ret: 0, op_ret: 0 
[2023-04-29 21:36:14.635820 +0000] I [MSGID: 106492] [glusterd-handler.c:2730:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a629951e-716d-4288-b9ba-45da1e9b4713 
[2023-04-29 21:36:14.635860 +0000] I [MSGID: 106502] [glusterd-handler.c:2777:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend 
[2023-04-29 21:36:14.635937 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:683:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a629951e-716d-4288-b9ba-45da1e9b4713 
[2023-04-29 21:36:17.129359 +0000] I [MSGID: 106061] [glusterd-utils.c:10726:glusterd_volume_status_copy_to_op_ctx_dict] 0-management: Dict get failed [{Key=count}] 
[2023-04-29 21:36:17.130067 +0000] I [MSGID: 106499] [glusterd-handler.c:4373:__glusterd_handle_status_volume] 0-management: Received status volume req for volume volume1

iwalker · April 30, 2023, 7:47am

GlusterFS needs quorum, which would mean 3 nodes - as you only have two nodes, then it doesn’t meet quorum. I believe you can do something to run Gluster as a 2-node setup but I wouldn’t recommend it. The problems you will have though, is if one node goes down, then you won’t be able to mount the gluster partitions. I’ve had this before, at least 2 nodes had to be active for mounts to be mounted and accessible.

fhon · April 30, 2023, 1:04pm

Thank you for the reply. I knew just having just 2 nodes in the cluster wasn’t a great idea, but I thought it would work for testing. I followed the the guide here: [Clustering-GlusterFS - Documentation]
and they show the self-heal daemon running with just 2 nodes. I’m just worried if I have 3 nodes in the cluster and one fails, I’ll be back in this situation. I should mention I’m running gluster10 instead of gluster9 like the guide.

iwalker · April 30, 2023, 2:18pm

A 3 node setup will tolerate a single-node failure, so you would have two nodes running. Yes it will complain that the cluster is down a node, which means you have to fix the broken node or introduce a new node, so that there are 3 again.

fhon · May 1, 2023, 11:47pm

Well… You were right. I spun up a VM to act as the 3rd Gluster server and everything worked. I’m surprised it doesn’t report more clearly that that’s the issue and that it even tries to run the self-heal process.

Anyway, thank you for your help!