How to set all cpu core's num for a job directly, no care host's num

I know the general resource req. For example, three resource chunks are defined in the script, and 32 core per chunk|node, eg:
#PBS -l select=3:ncpus=32

But, normal user may be want just input total cpu core’s num, no care how many chunk or node will be use. How to do that?
eg:
how to directly define the total number of CPU cores required to be 96, regardless of num of resource chunk or vnodes are required?
#PBS -l ncpus=96 will conver to
#PBS -l select=1:ncpus=96
#PBS -l place=pack, and one host has only 32 cores, then job faild as “Insufficient amount of resource: ncpus”

The follow is failed job info:
Job Id: 41.poc-self-master
Job_Name = pbs-test-job
Job_Owner = hpcuser@poc-self-login
job_state = Q
queue = workq
server = poc-self-master
Checkpoint = u
ctime = Wed Dec 22 17:21:41 2021
Error_Path = poc-self-login:/home/nfs/pbs-test-job.e41
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Dec 22 17:21:41 2021
Output_Path = poc-self-login:/home/nfs/pbs-test-job.o41
Priority = 0
qtime = Wed Dec 22 17:21:41 2021
Rerunable = True
Resource_List.ncpus = 96
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:ncpus=96
substate = 10
Variable_List = PBS_O_HOME=/home/hpcuser,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=hpcuser,
PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/
bin:/home/hpcuser/.local/bin:/home/hpcuser/bin,
PBS_O_MAIL=/var/spool/mail/hpcuser,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/nfs,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=poc-self-login
comment = Not Running: Insufficient amount of resource: ncpus
etime = Wed Dec 22 17:21:41 2021
Submit_arguments = pbs_mpi_test_script
project = _pbs_project_default

You could try -l select=96:ncpus=1 -l place=free. This would find CPUs for the job whereever they were available. It might give poor performance if the CPUs of the job need to talk to each other a lot (e.g., with MPI).

I got it, thank you for your reply. I still have a new question, sorry.
There are 2 hosts, poc-self-master and poc-self-com001, 32 cores and 64 threads(HT) per host

case1:
#PBS -l select=2:ncpus=32:mpiprocs=32
#PBS -l place=free
Because free, total 64 cpus of 2 chunks all come from poc-self-master(HT, threads=64), output:
resources_used.cpupercent = 4549
resources_used.cput = 00:01:31
resources_used.mem = 0kb
resources_used.ncpus = 64
resources_used.vmem = 0kb
resources_used.walltime = 00:00:02
job_state = F
queue = workq
server = poc-self-master
Checkpoint = u
ctime = Fri Dec 24 11:13:51 2021
Error_Path = poc-self-login:/home/nfs/pbs-test-job.e86
exec_host = poc-self-master/032+poc-self-master/132
exec_vnode = (poc-self-master:ncpus=32)+(poc-self-master:ncpus=32)

case2:
#PBS -l select=2:ncpus=32:mpiprocs=32
#PBS -l place=scatter
Because scatter, one chunk comes from master, the other one comes from com002, output:
resources_used.cpupercent = 100
resources_used.cput = 00:00:01
resources_used.mem = 428kb
resources_used.ncpus = 64
resources_used.vmem = 12908kb
resources_used.walltime = 00:00:00
job_state = F
queue = workq
server = poc-self-master
Checkpoint = u
ctime = Fri Dec 24 11:19:12 2021
Error_Path = poc-self-login:/home/nfs/pbs-test-job.e87
exec_host = poc-self-master/032+poc-self-com001/032
exec_vnode = (poc-self-master:ncpus=32)+(poc-self-com001:ncpus=32)

Question:
case1(use 64 HT threads in master):
resources_used.cpupercent = 4549
resources_used.cput = 00:01:31

case2(32 core in master,32 core in com001):
resources_used.cpupercent = 100
resources_used.cput = 00:00:01

I knows something about hyper thread(HT), but I still confused.
the cpupercent and cput of the 2 case are very different, why???
I even wonder if I have a wrong understanding of cputs and cpupercent or not.

Your test case ran too quickly. Try a test that runs for several minutes and see if you get more reasonable results. (Also, are you sure the two-node case ran correctly? The walltime is so short it looks like it might have failed on startup.)

It is interesting that cramming all the processes onto one node via hyperthreading was so slow. Where I used to work, we seldom used hyperthreading, and this is a good example of why.

Thank you for your reply, I think the two cases ran correcttly, which are hello example like this,which only MPI_Init MPI_Finalize, and that is why case run so quickly:
int main(int argc, char* argv[])
{
int rank, size, len, name_len;;
char version[MPI_MAX_LIBRARY_VERSION_STRING];
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &name_len);
MPI_Get_library_version(version, &len);
printf("Hello, world, I am %d of %d on %s, (%s, %d)\n",
       rank, size, processor_name, version, len);

MPI_Finalize();

return 0;

}

I think HT should be disabled directly, or not be disabled is also ok when I only run 32 MPI processes on one host which has 32 cores(64 HT threads), because it will just only use the 32 natural cores automatically not HT threads
Am I right ? Thank you!

Usually HT is disabled in the BIOS for most of the HPC clusters , unless there is a reason for the speedup (gained by enabling HT) by some of the applications. Mostly this would be application specific.

You are mostly correct. Modern Linux kernels do a reasonable job of spreading CPU load across HT resources. So, it’s okay to leave HT enabled, provided you don’t have more active threads than the number of real cores. I believe this will be only slightly slower than disabling HT in the BIOS, depending on the processor hardware version.

[My memory is not positive on the next parts.]
This is what we did at the site where I worked. We configured the MoMs so that they reported ncpus as the number of cores, so jobs would not use hyperthreading by accident. If a job wanted hyperthreading, it requested ncpus of half what it really wanted and then just used all the threads. (Nodes were exclusive.) When we wanted to run benchmarks for the Top 500 ranking, we turned the secondary threads off after boot by writing 0 to /sys/devices/system/cpu/cpuX/online for each HT thread. This was almost as good as disabling HT in the BIOS.

1 Like

Thank you for your reply, I will disable HT in BIOS at this stage. I think starccm+ should just use real cores. Thank you!

Thank you for your reply.
1.The first part of your reply is very usefull for me, which verified my point of view, OS kernel can use (real) cores reasonably as long as I don’t have more active threads|processes than real cores
2.I will disable HT in BIOS directly, and I’m not sure if pbs mom reports real cores by default or not, I will check that soon

Just to make things more complicated:

Before you commit to turning HT off in the BIOS, you might run some benchmarks first. That is, there are many processes running on a typical Linux install other than your application. Some of these will need the CPU from time to time (e.g., kernel threads, device drivers, including the one for your interconnect, the PBS MoM herself, timers, …). If you run with HT off and run as many application threads per host as you have cores, then every time one of these other threads needs the CPU, the application thread on that CPU gets completely suspended. On the other hand, with HT enabled, the application thread just runs a little slower for a time while the system process uses the other thread for that core. Which way is better is very application dependent.

So, to tune your system for StarCCM+, you could come up with a representative run that takes about 10 minutes (adjust the number of iteration steps to get that time). Then run that job three or four times under each of the following conditions:

  • HT enabled in the BIOS, ncpus/mpiprocs = # of cores.

  • HT disabled in the BIOS, ncpus/mpiprocs = # of cores.

  • HT disabled in the BIOS, ncpus/mpiprocs = (# of cores - 1)

The last one sets aside a core for systemy threads.

I’m assuming you run the nodes in exclusive mode, so the benchmark of interest is job walltime.

Thank you very much! I learned a lot from you.
In my case, there is only one job at a time, which is equal to exclusive mode I think.
The “# of cores” of the 3 cases above are all num of real cores (32), not HT cores(64), is it right?

Correct. I usually distinguish between a core and the (usually) two execution threads that can run simultaneously on the core.

Note that in the third test case, you might need to tell StarCCM+ that there are fewer ranks than in the other cases. I don’t know if it figures that out on its own.

Thank you for your reply.
I think starccm can’t figure out ranks on its own, what I need to do is set ncpus and mpiprocess in pbs script.
I will share the test result to you when the test done. Thank you!