Error deploying Ceph Quincy using ceph-ansible 7 on Rokcy 9

Hi,

I am trying to deploy Ceph Quincy using ceph-ansible on Rocky9. I am having some problems and I don’t know where to search for the reason.

PS : I did the same deployment on Rocky8 using ceph-ansible for the Pacific version on the same hardware and it worked perfectly.

I have 03 controllers nodes : mon, mgr, mdss and rgws
and 27 osd nodes : with 04 nvme disks (osd) each
I am using a 10Gb network with jumbo frames.

The deployment starts with no issues, the 03 monitors are created correctly, then the 03 managers are created, after that the OSD are prepared and formatted, until here everything is working fine, but when the “wait for all osd to be up” task is launched, which means starting all OSDs containers in all OSD nodes, things go south, the monitors become out of quorum, ceph -s takes a lot of time to respond and not all OSDs are being activated, and the deployment fails at the end.

cluster 2023-03-06T12:00:26.431947+0100 mon.controllera (mon.0) 3864 : cluster [WRN] [WRN] MON_DOWN: 1/3 mons down, quorum controllera,controllerc 
cluster 2023-03-06T12:00:26.431953+0100 mon.controllera (mon.0) 3865 : cluster [WRN]     mon.controllerb (rank 1) addr [v2:20.1.0.27:3300/0,v1:20.1.0.27:6789/0] is down (out of quorum)

The monitor container in 2 of my controllers nodes stays at 100% of cpu utilization.

CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
068e4e55f299   ceph-mon-controllera   99.91%    58.12MiB / 376.1GiB   0.02%     0B / 0B   122MB / 85.3MB   28  <-----------------
87730f89420d   ceph-mgr-controllera   0.32%     408.2MiB / 376.1GiB   0.11%     0B / 0B   181MB / 0B       35

Could that be a resource problem? the monitor containers do not have enough resources CPU, RAM, …etc to handle all the OSDs that are being started?
If yes, how may I find this?

thanks in advance.

Regards.

I had a similar issue when I was testing Ceph Quincy on Rocky 9 VMs via cephadm, it was resources issues in my case, once I upgraded the VMs to better machines, it worked fine, still have some issues with PGs and EC but this one on me, still figuring out the right calculation for that. is ceph-ansible deploys vms? or containers on VMs?