Project Documentation Adding options to control scheduler multi-threading

Background/motivation: We are adding multi-threading to the scheduler. To that extent, I got feedback to add a manual, admin lever to control the number of threads that the scheduler will launch.

EDD:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1344208899/Adding+options+to+control+scheduler+multi-threading

Please provide feedback. Thanks.

Will the scheduler always run the number of threads specified, or is this a maximum limit? If it is the latter, I would suggest naming the parameter sched_max_threads.

The scheduler will always run the number specified, itā€™s an admin control, so the assumption is that admins know their system best!

The current name works for me. Thanks for the explanation.

1 Like

could you please comment:
0. what are the reasons for multi-threading the scheduler

  1. would having more cores increases the multi-threaded scheduler performance ?
  2. would having more Ghz per core increase the multi-threaded scheduler performance ?
  3. both 1 and 2
  4. Does more and faster RAM and DISK
    - is required to cope up with multi-threaded scheduler speed
    - has any advantage on the multi-threaded scheduler

We found multiple performance bottlenecks in the scheduler which were good contenders for parallelization, so this is being done for performance improvement.

Yes and no, it only makes sense to parallelize when there is enough work to be partitioned. If there isnā€™t (e.g - the site doesnā€™t have a lot of nodes or too many jobs), then scheduler will only scale out as much as will be helpful, which means that it might even use just 1 worker thread if there isnā€™t enough work while the others will remain asleep. So, throwing cores at it wonā€™t necessarily improve its performance. This is how it is designed right now, the code is still under review so it might change. If it does, Iā€™ll update my answer here.

Always

Faster RAM and DISK will help a multi-threaded scheduler just like it would help out current single threaded scheduler. More RAM isnā€™t required to cope up with multi-threaded scheduler speed, itā€™s not multi-processing where the RAM utilization can grow after copy-on-write(s), the threads share common memory, except for minor thread-local storage stuff, so it shouldnā€™t bloat up the RAM.

1 Like

Thank you @agrawalravi90 . Much appreciate it.

  • If core pinning is supported with multi-threading , it would be useful. So that cores can be dedicated to scheduler in a multi-tenant-services hosting system.

Thanks for the suggestion @adarsh. I might be wrong, but I think core pinning might negatively affect cpu utilization of the machine. As I mentioned before, the schedulerā€™s multiple threads will sleep until work is available for them, and there might not be enough work to engage all of them. So, if we pin cores, we might have those cores sitting idle if the OS doesnā€™t schedule other programs on them. The scheduler also runs in cycles, so those cores might also sit idle in between sched cycles. I think we should let the OS do its job.

The attribute being proposed in this EDD should help the admins to restrict how many threads the scheduler will create, thereby controlling the schedulerā€™s footprint on the host system which might have other services running.

Please let me know what you think.

1 Like

Love the idea of multithreading the scheduler (with the usual concern that this also makes debugging and code maintenance way more complex) ā€“ cool!

Regarding a new qmgr Sched object attribute ā€œsched_threadsā€ ā€“ this got me thinking. Although there is power in making everything configurable via qmgr, qmgr settings imply a lot of additional capabilities (and effort and backward compatibility) that may not be warranted, e.g., the qmgr interface, passing the data between server & scheduler, ability to change the value on the fly without restarting anything, an API to change the value. What about picking a ā€œgreat defaultā€ and having a way to override it that is possibly harder, but possible, e.g., a command line arg or env. variable? (If there is some standard way to express this type of setting in Linux, that would be ideal.). Iā€™m not saying qmgr is bad, just that it seems like overkill in this caseā€¦

Thx!

Thanks for the feedback Bill. I kind of jumped the gun and implemented the code for a qmgr attribute since this came about from the code review of multi-threading, totally my bad. If you think that doing it as a qmgr controlled attribute is going to cause maintenance issues going forward then I can revert it and implement it as a pbs_sched option/pbs.conf variable instead. Please let me know if youā€™d like me to change it. Thanks again!

Thank you, i am completely in line with your thought process and understanding. I also feel if the scheduler is multi-threaded, automatically (AI / Deep learning) it should scale up and down the number threads depending on the cores , number of nodes , number of jobs and how busy each cores are (so not to slow down other services) + manual setup the number of threads in a configuration file.

I would recommend (feel) to give an option to enable or disable core pinning (this would be a mean option). This might help multi-scheduler setup . This would be a step forward to have some core pinning capabilities in PBS Pro not dependent on MPI based core pinning.

I am not sure how this would affect ā€“ If the server host with N cores is completely dedicated to PBS Server/Scheduler/Comm ? ā€“ with existing single threaded scheduler and multi-threaded scheduler, would the performance still differ. Also, whether this will continue to be stateless as before ?

Thanks again for your explanation, it is indeed a big step forward. :+1:

Iā€™m not sure core pinning would buy us much. In a latency sensitive distributed application it is important for the processes to avoid jitter of any kind. This includes preventing the kernel from moving processes between cores. In this case, the scheduler is multi-threaded but not distributed, so jitter becomes less of an issue. Depending on what else is running on the node, itā€™s likely best to let the kernel manage where processes are assigned based on the configured process scheduling policy.

Iā€™m not saying there wouldnā€™t be potential benefits, but Iā€™m not sure the benefits are worth the effort. Might be best to explore further down the road if need be.

1 Like

Thank you @mkaro and @agrawalravi90 . I understand now and have answers/reasons for multi-threaded and core pinning.

1 Like

Hi all,

Iā€™ve taken @billnitzbergā€™s advice and changed the design to be a pbs_sched option instead of a qmgr attribute as itā€™ll make it easier to maintain, plus itā€™s not really a behavioral configuration of the scheduler, so i think it doesnā€™t make sense to make it a qmgr setting. But please let me know what you think.

Iā€™ve also made it a ā€˜maxā€™ setting to provide flexibility in changing the threading logic in future.

Please provide feedback. Thanks!

My only comment is that we should use a more reasonable range for the -t parameter. Personally, I wouldnā€™t want any service to start more than 1024 threads on a node, but that also depends on the node. Could we devise a function based on core count and memory to define what the maximum should be on a given system? For exampleā€¦

min((core_count * 10), (free_mem / 2^23), 1024)

PBS is registered as a systemd service under /sys/fs/cgroup/memory/system.slice/pbs.service and I confirmed the tasks file contains the PID of the scheduler. Do we need to worry about systemd if we start a large number of threads?

Thanks for your inputs Mike. By default, right now, we will create as many threads as the number of cores, or less than that (depending on what people agree with in the code review phase). So, this is just to give admins a manual way to configure that limit. If they choose to shoot themselves in the foot, well itā€™s really their fault. Having said that, Iā€™m open to suggestions on what this upper limit should be, but I donā€™t want us to write a complicated logic for determining it, it should be pretty obvious to an admin whatā€™s a reasonable number of cores to give to service. I chose 99,999 because thereā€™s already many-core chips like https://en.wikichip.org/wiki/pezy/pezy-scx/pezy-sc2 which has 2048 cores.

Hereā€™s what I want to avoidā€¦ a novice admin sets a large core count on a system with few resources that ends up hanging the system and forcing a hard reboot. When PBS comes up after the reboot it hangs the system again. Eventually they figure out what the problem is and PBS Pro gets the blame, and gets replaced with a competing product. Iā€™ll leave it to you to figure out the answer, but thought I would share my concern.

Thanks for sharing your concern Mike! I think what we can do to avoid such a situation is to cap the number of threads at the core count. In certain situations having more threads than the core count can help, but I think we can cap it at core count for now and revisit this if we change the parallelization logic in the scheduler.

2 Likes

Ok, i modified the EDD to reflect the same, please let me know if you are ok with it.

Seems reasonable, nice and simple. Thanks @agrawalravi90.