I am trying to deploy Ceph Quincy using ceph-ansible on Rocky9. I am having some problems and I don’t know where to search for the reason.
PS : I did the same deployment on Rocky8 using ceph-ansible for the Pacific version on the same hardware and it worked perfectly.
I have 03 controllers nodes : mon, mgr, mdss and rgws
and 27 osd nodes : with 04 nvme disks (osd) each
I am using a 10Gb network with jumbo frames.
The deployment starts with no issues, the 03 monitors are created correctly, then the 03 managers are created, after that the OSD are prepared and formatted, until here everything is working fine, but when the “wait for all osd to be up” task is launched, which means starting all OSDs containers in all OSD nodes, things go south, the monitors become out of quorum, ceph -s takes a lot of time to respond and not all OSDs are being activated, and the deployment fails at the end.
cluster 2023-03-06T12:00:26.431947+0100 mon.controllera (mon.0) 3864 : cluster [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum controllera,controllerc cluster 2023-03-06T12:00:26.431953+0100 mon.controllera (mon.0) 3865 : cluster [WRN] mon.controllerb (rank 1) addr [v2:22.214.171.124:3300/0,v1:126.96.36.199:6789/0] is down (out of quorum)
The monitor container in 2 of my controllers nodes stays at 100% of cpu utilization.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 068e4e55f299 ceph-mon-controllera 99.91% 58.12MiB / 376.1GiB 0.02% 0B / 0B 122MB / 85.3MB 28 <----------------- 87730f89420d ceph-mgr-controllera 0.32% 408.2MiB / 376.1GiB 0.11% 0B / 0B 181MB / 0B 35
Could that be a resource problem? the monitor containers do not have enough resources CPU, RAM, …etc to handle all the OSDs that are being started?
If yes, how may I find this?
thanks in advance.