Configure corosync with pacemaker on rocky 9

In my test cluster, the corresponding remote site is always displayed as offline, even though there is a connection. In can manage resources with pcs on both systems. Can anybody help me further?

Configuration:

yum config-manager --set-enabled highavailability
yum install corosync pacemaker pcs -y
passwd hacluster
systemctl enable pcsd.service; systemctl start pcsd.service
pcs host auth server-1 server-2
pcs cluster setup test_cluster server-1 server-2
pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.1.100 cidr_netmask=24 op monitor interval=30s
firewall-cmd --zone=public --add-port=5404/udp --permanent
firewall-cmd --zone=public --add-port=2224/tcp --permanent
firewall-cmd --permanent --add-service=high-availability
pcs cluster enable --all
[root@server-2 ~]# pcs cluster enable --all
server-2: Cluster Enabled
server-1: Cluster Enabled

[root@server-2 ~]# pcs status
Cluster name: test_cluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: server-2 (version 2.1.6-10.1.el9_3-6fdc9deea29) - partition WITHOUT quorum
  * Last updated: Fri Mar 29 00:27:44 2024 on server-2
  * Last change:  Fri Mar 29 00:01:34 2024 by root via cibadmin on server-2
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ server-2 ]
  * OFFLINE: [ server-1 ]

Full List of Resources:
  * vip	(ocf:heartbeat:IPaddr2):	 Started server-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I can not find a problem in my configuration, even tried to disable firewalld.

server-1

corosync-cfgtool -s
Local node ID 2, transport knet
LINK ID 0 udp
	addr	= 127.0.0.1
	status:
		nodeid:          1:	disconnected
		nodeid:          2:	localhost

server-2

corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0 udp
	addr	= 127.0.0.1
	status:
		nodeid:          1:	localhost
		nodeid:          2:	disconnected

Authorization without errors:

pcs host auth server-01 server-3 -u hacluster -p 'hacluster' --debug
...
  "log": [
    "I, [2024-04-06T11:42:40.671772 #2799]  INFO -- : PCSD Debugging enabled\n",
    "D, [2024-04-06T11:42:40.671809 #2799] DEBUG -- : Detected systemd is in use\n",
    "I, [2024-04-06T11:42:40.936876 #2799]  INFO -- : Connecting to: https://server-2:2224/remote/auth\n",
    "I, [2024-04-06T11:42:40.937914 #2799]  INFO -- : Connecting to: https://server-1:2224/remote/auth\n",
    "I, [2024-04-06T11:42:41.238872 #2799]  INFO -- : Sending config 'known-hosts' version 7 1766fcd806cf6cbdd6e98db9e68dc68e6f44d7ae to nodes: server-2, server-1\n",
    "I, [2024-04-06T11:42:41.238986 #2799]  INFO -- : SRWT Node: server-2 Request: set_configs\n",
    "I, [2024-04-06T11:42:41.239614 #2799]  INFO -- : Connecting to: https://server-2:2224/remote/set_configs\n",
    "I, [2024-04-06T11:42:41.240279 #2799]  INFO -- : SRWT Node: server-1 Request: set_configs\n",
    "I, [2024-04-06T11:42:41.240783 #2799]  INFO -- : Connecting to: https://server-1:2224/remote/set_configs\n",
    "I, [2024-04-06T11:42:41.360234 #2799]  INFO -- : Sending config response from server-1: {\"status\"=>\"ok\", \"result\"=>{\"known-hosts\"=>\"accepted\"}}\n",
    "I, [2024-04-06T11:42:41.360276 #2799]  INFO -- : Sending config response from server-2: {\"status\"=>\"ok\", \"result\"=>{\"known-hosts\"=>\"accepted\"}}\n"
  ]
}

--Debug Stdout End--
--Debug Stderr Start--

--Debug Stderr End--

server-2: Authorized
server-1: Authorized

what have I missed here?

Hi,
For test. Remove floating IP resource. Stop firewall and disable it.
Reboot one by one servers.
If use right interface address to communicate between servers, corosync will show connection.
If not, check interface to communication.
Post result.

I forgot
cat /etc/corosync/corosync.conf
cat /etc/hosts

corosync-cmapctl |grep member

1 Like

Corosync is using UDP-Port 5404, this Port is not open according to nmap.

I’m creating the cluster on a hypervisor. I would like to test the settings before I bring the whole thing to hardware. I’m currently using VirtualBox, the Rocky9 systems have two interfaces, “NAT” and one “host-only adapter” with their own network. I have already deactivated the NAT interface and started the servers one after the other, but that didn’t help.

Is there a reason why the UDP port is not usable, or could it be VirtualBox?

[root@server-1 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: test_cluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    cluster_uuid: f04a8d87115946078f857bfd9d638e1b
}

nodelist {
    node {
        ring0_addr: server-2
        name: server-2
        nodeid: 1
    }

    node {
        ring0_addr: server-1
        name: server-1
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}

[root@server-1 ~]# cat /etc/hosts 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 server-1
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.112 server-2

[root@server-1 ~]# corosync-cmapctl | grep member
runtime.members.2.config_version (u64) = 0
runtime.members.2.ip (str) = r(0) ip(127.0.0.1) 
runtime.members.2.join_count (u32) = 1
runtime.members.2.status (str) = joined


[root@server-1 ~]# nmap -n -PN -sT -sU -p 22,2224,2225,5404,21064 server-2
Starting Nmap 7.92 ( https://nmap.org ) at 2024-04-06 16:50 CEST
Nmap scan report for server-2 (192.168.1.112)
Host is up (0.00040s latency).

PORT      STATE  SERVICE
22/tcp    open   ssh
2224/tcp  open   efi-mg
2225/tcp  closed rcip-itu
5404/tcp  closed hpoms-dps-lstn
21064/tcp closed unknown
22/udp    closed ssh
2224/udp  closed efi-mg
2225/udp  closed rcip-itu
5404/udp  closed hpoms-dps-lstn
21064/udp closed unknown

[root@server-1 ~]# systemctl status firewalld.service
â—‹ firewalld.service - firewalld - dynamic firewall daemon
     Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; preset: en>
     Active: inactive (dead)
       Docs: man:firewalld(1)
       
[root@server-1 ~]# systemctl status corosync.service 
â—Ź corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: disa>
     Active: active (running) since Sat 2024-04-06 16:30:44 CEST; 26min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1210 (corosync)
      Tasks: 9 (limit: 16930)
     Memory: 145.9M
        CPU: 7.204s
     CGroup: /system.slice/corosync.service
             └─1210 /usr/sbin/corosync -f

Apr 06 16:30:44 server-1 corosync[1210]:   [KNET  ] host: host: 1 has no activ>
Apr 06 16:30:44 server-1 corosync[1210]:   [QUORUM] Sync members[1]: 2
Apr 06 16:30:44 server-1 corosync[1210]:   [QUORUM] Sync joined[1]: 2
Apr 06 16:30:44 server-1 corosync[1210]:   [TOTEM ] A new membership (2.23) wa>
Apr 06 16:30:44 server-1 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:30:44 server-1 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:30:44 server-1 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:30:44 server-1 corosync[1210]:   [QUORUM] Members[1]: 2
Apr 06 16:30:44 server-1 corosync[1210]:   [MAIN  ] Completed service synchron>
Apr 06 16:30:44 server-1 systemd[1]: Started Corosync Cluster Engine.


======================================================================================

[root@server-2 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: test_cluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    cluster_uuid: f04a8d87115946078f857bfd9d638e1b
}

nodelist {
    node {
        ring0_addr: server-2
        name: server-2
        nodeid: 1
    }

    node {
        ring0_addr: server-1
        name: server-1
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}
[root@server-2 ~]# cat /etc/hosts 
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 server-2
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.111 server-1


[root@server-2 ~]# corosync-cmapctl | grep member
runtime.members.1.config_version (u64) = 0
runtime.members.1.ip (str) = r(0) ip(127.0.0.1) 
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined

[root@server-2 ~]# nmap -n -PN -sT -sU -p 22,2224,2225,5404,21064 server-1
Starting Nmap 7.92 ( https://nmap.org ) at 2024-04-06 16:50 CEST
Nmap scan report for server-1 (192.168.1.111)
Host is up (0.00044s latency).

PORT      STATE  SERVICE
22/tcp    open   ssh
2224/tcp  open   efi-mg
2225/tcp  closed rcip-itu
5404/tcp  closed hpoms-dps-lstn
21064/tcp closed unknown
22/udp    closed ssh
2224/udp  closed efi-mg
2225/udp  closed rcip-itu
5404/udp  closed hpoms-dps-lstn
21064/udp closed unknown

[root@server-2 ~]# systemctl status firewalld.service
â—‹ firewalld.service - firewalld - dynamic firewall daemon
     Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; preset: en>
     Active: inactive (dead)
       Docs: man:firewalld(1)
lines 1-4/4 (END)

[root@server-2 ~]# systemctl status corosync.service
â—Ź corosync.service - Corosync Cluster Engine
     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: disa>
     Active: active (running) since Sat 2024-04-06 16:31:55 CEST; 26min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1210 (corosync)
      Tasks: 9 (limit: 16930)
     Memory: 145.9M
        CPU: 7.429s
     CGroup: /system.slice/corosync.service
             └─1210 /usr/sbin/corosync -f

Apr 06 16:31:55 server-2 corosync[1210]:   [KNET  ] host: host: 2 has no activ>
Apr 06 16:31:55 server-2 corosync[1210]:   [QUORUM] Sync members[1]: 1
Apr 06 16:31:55 server-2 corosync[1210]:   [QUORUM] Sync joined[1]: 1
Apr 06 16:31:55 server-2 corosync[1210]:   [TOTEM ] A new membership (1.23) wa>
Apr 06 16:31:55 server-2 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:31:55 server-2 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:31:55 server-2 corosync[1210]:   [VOTEQ ] Waiting for all cluster me>
Apr 06 16:31:55 server-2 corosync[1210]:   [QUORUM] Members[1]: 1
Apr 06 16:31:55 server-2 corosync[1210]:   [MAIN  ] Completed service synchron>
Apr 06 16:31:55 server-2 systemd[1]: Started Corosync Cluster Engine.

I tested the UDP connections with a python example (with a small server and client script), it is working in my environment. The problem must be related to rocky linux or my configuration. I’m drawing a blank. :melting_face:
Do you have any idea?

I think I found the problem, I used:

pcs cluster setup test_cluster --start server-2 server-1 --force

If I change this value inside corosync.conf to the actual IP, both servers are online.

ring0_addr: server-*

The hosts file looks like:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 server-2
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.112 server-2
192.168.1.111 server-1

I noticed this problem with netstat. :grimacing:

udp        0      127.0.0.1:5405     0.0.0.0:*                           0          28920      2809/corosync 

Sorry for the late reply!
I’m trying to catch up, but I haven’t done a cluster in 6-7 years, and I may have forgotten some things.
First, let’s clarify the situation exactly and correctly, please!
The two servers have a direct LAN to LAN connection, or the connectivity between them goes through a switch.
The question is not how many interfaces the servers have, but how the service assignments (for synchronization and communication in the cluster) connect the two nodes (direct cable or through a switch).
From what I’ve read so far, if I understand and interpret the problem correctly, the problem probably starts with the interface and the primary initialization, creation and authentication of the cluster.
The hosts file has the addresses and names of both nodes. This is good.
The question is whether the interface to which the addresses are assigned is the service one? If so, it is correct, if not, it must be corrected.
After that, the passwordless SSH connection between the two nodes should be done only and only through the designated service interface with the specified addresses in the hosts file.
Check the passwordless connection between the two nodes! It is important to verify the connection by both name and address, each to each. Here it is important to ensure that each node is forced to read the hosts file first and then the DNS.

cat /etc/host.conf

order hosts bind
multi on

If the described is so done, it remains the authentication and primary creation.

pcs cluster auth server-1 server-2 --force
pcs cluster auth 192.168.1.111 192.168.1.112 --force
pcs cluster setup --name test_cluster server-1 server-2
pcs cluster start --all
pcs cluster enable --all
pcs cluster sync --all
pcs status

pcs status --full
pcs resource --full
pcs constraint --full

Then check: netstat -uanlp <=> ss -uanlp | grep 5405

Communication must be to the service interface address, not localhost!
That’s what I think.