Background/motivation: We are adding multi-threading to the scheduler. To that extent, I got feedback to add a manual, admin lever to control the number of threads that the scheduler will launch.
Please provide feedback. Thanks.
Background/motivation: We are adding multi-threading to the scheduler. To that extent, I got feedback to add a manual, admin lever to control the number of threads that the scheduler will launch.
Please provide feedback. Thanks.
Will the scheduler always run the number of threads specified, or is this a maximum limit? If it is the latter, I would suggest naming the parameter sched_max_threads.
The scheduler will always run the number specified, itās an admin control, so the assumption is that admins know their system best!
The current name works for me. Thanks for the explanation.
could you please comment:
0. what are the reasons for multi-threading the scheduler
We found multiple performance bottlenecks in the scheduler which were good contenders for parallelization, so this is being done for performance improvement.
Yes and no, it only makes sense to parallelize when there is enough work to be partitioned. If there isnāt (e.g - the site doesnāt have a lot of nodes or too many jobs), then scheduler will only scale out as much as will be helpful, which means that it might even use just 1 worker thread if there isnāt enough work while the others will remain asleep. So, throwing cores at it wonāt necessarily improve its performance. This is how it is designed right now, the code is still under review so it might change. If it does, Iāll update my answer here.
Always
Faster RAM and DISK will help a multi-threaded scheduler just like it would help out current single threaded scheduler. More RAM isnāt required to cope up with multi-threaded scheduler speed, itās not multi-processing where the RAM utilization can grow after copy-on-write(s), the threads share common memory, except for minor thread-local storage stuff, so it shouldnāt bloat up the RAM.
Thank you @agrawalravi90 . Much appreciate it.
Thanks for the suggestion @adarsh. I might be wrong, but I think core pinning might negatively affect cpu utilization of the machine. As I mentioned before, the schedulerās multiple threads will sleep until work is available for them, and there might not be enough work to engage all of them. So, if we pin cores, we might have those cores sitting idle if the OS doesnāt schedule other programs on them. The scheduler also runs in cycles, so those cores might also sit idle in between sched cycles. I think we should let the OS do its job.
The attribute being proposed in this EDD should help the admins to restrict how many threads the scheduler will create, thereby controlling the schedulerās footprint on the host system which might have other services running.
Please let me know what you think.
Love the idea of multithreading the scheduler (with the usual concern that this also makes debugging and code maintenance way more complex) ā cool!
Regarding a new qmgr Sched object attribute āsched_threadsā ā this got me thinking. Although there is power in making everything configurable via qmgr, qmgr settings imply a lot of additional capabilities (and effort and backward compatibility) that may not be warranted, e.g., the qmgr interface, passing the data between server & scheduler, ability to change the value on the fly without restarting anything, an API to change the value. What about picking a āgreat defaultā and having a way to override it that is possibly harder, but possible, e.g., a command line arg or env. variable? (If there is some standard way to express this type of setting in Linux, that would be ideal.). Iām not saying qmgr is bad, just that it seems like overkill in this caseā¦
Thx!
Thanks for the feedback Bill. I kind of jumped the gun and implemented the code for a qmgr attribute since this came about from the code review of multi-threading, totally my bad. If you think that doing it as a qmgr controlled attribute is going to cause maintenance issues going forward then I can revert it and implement it as a pbs_sched option/pbs.conf variable instead. Please let me know if youād like me to change it. Thanks again!
Thank you, i am completely in line with your thought process and understanding. I also feel if the scheduler is multi-threaded, automatically (AI / Deep learning) it should scale up and down the number threads depending on the cores , number of nodes , number of jobs and how busy each cores are (so not to slow down other services) + manual setup the number of threads in a configuration file.
I would recommend (feel) to give an option to enable or disable core pinning (this would be a mean option). This might help multi-scheduler setup . This would be a step forward to have some core pinning capabilities in PBS Pro not dependent on MPI based core pinning.
I am not sure how this would affect ā If the server host with N cores is completely dedicated to PBS Server/Scheduler/Comm ? ā with existing single threaded scheduler and multi-threaded scheduler, would the performance still differ. Also, whether this will continue to be stateless as before ?
Thanks again for your explanation, it is indeed a big step forward.
Iām not sure core pinning would buy us much. In a latency sensitive distributed application it is important for the processes to avoid jitter of any kind. This includes preventing the kernel from moving processes between cores. In this case, the scheduler is multi-threaded but not distributed, so jitter becomes less of an issue. Depending on what else is running on the node, itās likely best to let the kernel manage where processes are assigned based on the configured process scheduling policy.
Iām not saying there wouldnāt be potential benefits, but Iām not sure the benefits are worth the effort. Might be best to explore further down the road if need be.
Thank you @mkaro and @agrawalravi90 . I understand now and have answers/reasons for multi-threaded and core pinning.
Hi all,
Iāve taken @billnitzbergās advice and changed the design to be a pbs_sched option instead of a qmgr attribute as itāll make it easier to maintain, plus itās not really a behavioral configuration of the scheduler, so i think it doesnāt make sense to make it a qmgr setting. But please let me know what you think.
Iāve also made it a āmaxā setting to provide flexibility in changing the threading logic in future.
Please provide feedback. Thanks!
My only comment is that we should use a more reasonable range for the -t parameter. Personally, I wouldnāt want any service to start more than 1024 threads on a node, but that also depends on the node. Could we devise a function based on core count and memory to define what the maximum should be on a given system? For exampleā¦
min((core_count * 10), (free_mem / 2^23), 1024)
PBS is registered as a systemd service under /sys/fs/cgroup/memory/system.slice/pbs.service and I confirmed the tasks file contains the PID of the scheduler. Do we need to worry about systemd if we start a large number of threads?
Thanks for your inputs Mike. By default, right now, we will create as many threads as the number of cores, or less than that (depending on what people agree with in the code review phase). So, this is just to give admins a manual way to configure that limit. If they choose to shoot themselves in the foot, well itās really their fault. Having said that, Iām open to suggestions on what this upper limit should be, but I donāt want us to write a complicated logic for determining it, it should be pretty obvious to an admin whatās a reasonable number of cores to give to service. I chose 99,999 because thereās already many-core chips like https://en.wikichip.org/wiki/pezy/pezy-scx/pezy-sc2 which has 2048 cores.
Hereās what I want to avoidā¦ a novice admin sets a large core count on a system with few resources that ends up hanging the system and forcing a hard reboot. When PBS comes up after the reboot it hangs the system again. Eventually they figure out what the problem is and PBS Pro gets the blame, and gets replaced with a competing product. Iāll leave it to you to figure out the answer, but thought I would share my concern.
Thanks for sharing your concern Mike! I think what we can do to avoid such a situation is to cap the number of threads at the core count. In certain situations having more threads than the core count can help, but I think we can cap it at core count for now and revisit this if we change the parallelization logic in the scheduler.
Ok, i modified the EDD to reflect the same, please let me know if you are ok with it.