Decreasing resources of a running job

fnevgeny · September 17, 2024, 10:51am

Hello,

I wonder why it isn’t possible with qalter to decrease the resources of a running job.

I believe it would be very useful in certain scenarios. Say, my typical jobs are multithreading, working in parallel on similar data chunks. However, due to a statistical spread of the data, it takes different time for each thread to finish. So quite often, while a majority of the threads finish more-or-less synchronously, a few (out of tens) continue massaging the bits for quite some time. The result is a mostly idle node. The same goes for the allocated memory. So, ideally, the job itself should be able to signal PBS that some resources are no longer needed and can be used for other jobs. Or, at least, it could be done by the user or an outside monitoring tool.

Best,
Evgeny

adarsh · September 17, 2024, 4:32pm

The system will be too embarassingly dynamic in nature and scheduler(multi schedulers) would be busy in find out where it could run from cycle to cycle .

There are some tool(s) and configuration(s) made available like

pbs_release_nodes_on_stageout. which releases the sister nodes of multi-node(MPI) job.
runone feature that is analogous to OR fuctionality with respect to qsub requests.
Also, most of the application users do a prerun (profile their input decks) on their input deck to find out estimated resources that is required to r un their job, this would help more optimized request and usage of resources.

+1 . On the contrary, it would be useful if the user submits a job without asking for any resources and PBS would figure out , how such similar jobs of such kind were run, successful, failed in the past and then assign the resource request based on that (what it learnt from the past) and then scheduled that job

speleolinux · September 18, 2024, 1:21am

That sounds like your angling for adding something like AI into PBS. Hopefully that won’t occur

adarsh · September 18, 2024, 6:46am

Some external tools might do it : GenAI

This is the common question from some of the application specialists, why we have to request resources, it ( ) has to figure out and submit jobs.

speleolinux · September 18, 2024, 11:10pm

Thanks for the link Adarsh. A brief read and it looks a very thorough bit of work. The conclusion says “This specific prediction task still remains challenging, as only partially successful results have been achieved on the collected data.” Hence it’s not likely to be seen soon. I still think the researcher is the one that is best placed to understand their job characteristics and make an appropriate split-up of their jobs with its resources. They just need to be motivated to do that with help from good post-job analytics.

adarsh · September 19, 2024, 9:03am

+1 @speleolinux – second that

Source · September 23, 2024, 10:37am

However, I would like to point out, you COULD do a qalter (shrink only) on some resources during the running phase of a job, e.g. walltime

Topic		Replies	Views
Dynamic ncpus/nodes/ppn specification Users/Site Administrators	1	1308	January 12, 2018
Efficient use of nodes in openpbs Users/Site Administrators	9	1047	February 28, 2022
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1626	June 18, 2019
Memory usage across nodes Users/Site Administrators	7	1019	November 29, 2024
Nodes resources not getting released to the resource pool after the job is run and exited Users/Site Administrators	4	508	September 16, 2022

Decreasing resources of a running job

Related topics