Performance or bottleneck analysis tools

Hi,

My system has four 40GbE ports. Four threads are allocated to four cores using pthread_affinity_np(), and each thread has single UDP socket and uses independently allocated memory as a buffer and sends data using sendto(). However, while one thread can achieve a send throughput of around 5GB/s, running all four threads simultaneously results in a decrease in the throughput to around 4GB/s each.

I’d like to know the cause of this performance degradation needs to be identified.
Please let me know if there are any tools or methods for this?

Thank you.

in my view this may be related to the type of cores, concurrent and parallel programming used. Single-thread (st) and multi-thread (mt) processing and available infrastructure and limits … the processing work architecture may have different operation from the default . i use gunicorn and this is the indication:

gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second. gunicorn relies on the operating system to provide all of the load balancing when handling requests. generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with.

but not only the number of workers is important, the processing demand of each core also influences.