New custom resource permission flag "m"

Hello,

I would like to introduce a new permission flag “m” for custom resources which will allow Admin to decide if a resource can be accessible by a MoM hook.

EDD can be seen here https://pbspro.atlassian.net/wiki/spaces/PD/pages/1034158087/Ability+to+include+requested+custom+resources+in+mom+hook+input+file.

Please take a look and share your thoughts.

Thanks,
Ashwath

Looks good @ashwathraop, thank you!

is there a reason why access restriction is needed? why not send all job resources to the mom hook?

A large custom string resource might be one example of something to avoid sending to MoM. I think it’s best to leave it up to the administrator to decide. Good point though.

Thank you for a clear design document. looks good.

The design is pretty good. There are couple of minor things I’ll point out:

  • Is it not valid to use this flag with the ‘n’ or ‘f’ flags?
  • I believe the ‘-=’ operator can be used to remove the ‘m’ flag. We should make sure though. IIRC, when you use ‘-=’ on a string attribute, PBS will remove the first occurrence. In our case, there will only be one ‘m’ occurrence, so we should be safe.

There might be something to @agrawalravi90’s comment. We could cause confusion to newbie hook writers when they expect their resources to be accessible by mom hooks. They’d need to hunt through the docs to realize they needed to add a ‘m’ flag to their resources.

@mkaro’s comment has merit too. Maybe we should test this out? Maybe create a test with 1000 string resources, and submit 1000 jobs requesting all 1000 resources with 1000+ character strings. We can then test how long it takes to send these jobs to mom. We can then run the same test without sending the resources to mom.

If it turns out that it’s fast, then we can drop the ‘m’ flag and make things easier on PBS hook writers.

I personally think the flag will be necessary, but I think we should do the test. It shouldn’t be that hard to write the and run the test.

Bhroam

  • To unset flag “m”, admin can either overwrite the flag value or do a “unset resource <name> flag”.

Be aware, doing this will remove all other flags that were set for that resource.

1 Like

It appears I’m incorrect on this point. I should have tested it first. I could have sworn it worked. Strangely enough, if you do a qmgr -c ‘s r foo type -= h’ you will do a full set operation to ‘h’ and that removes other flags.

Looking further, the -= works as I expected for attributes on other objects. I set resources_available.site = abc on the server. I then did a qmgr -c ‘s s resources_available.site -= b’ and it was then set to ‘ac’ like I expected it. I guess resources are special.

Bhroam

I have the same comment as @lisa-altair.
“unset resource < name > flag” will unset all flags.

I guess I missed n and f flags. Added them now. Thanks for catching it.

Regarding removing a flag, I am not sure if we have a straightforward way to do it. Hence I wrote either overwrite or unset all the flags then set the required flags.

I @crjayadev and @subhasisb did have a similar discussion on what @mkaro mentioned. Our concern was allowing all the resources to MoM might slow down the send_job operation when there are too many custom resources. I like your idea of testing it through and see what data shows. I’ll try and get back with results.

-Ashwath

@lisa-altair and @neha.padole, I added a note stating the concern you guys had.

1 Like

Looks good, thanks @ashwathraop!

I tested with around 950 custom resources being set for a job within a runjob hook. Each resource was set with a string value of 5000 characters. Then I looked for “Job Run” or “type 23” log on server and “Session id” or “type 5” message at mom log. I did not see any time difference. Job was sent in that same second.

So we may not see performance impact at send_job. Having said that we might see problems with execution hooks. At MoM we write all the job attributes including resources to a input file and later pbs_python reads this file and loads them to relevant data structures. So as the file grows we can expect some delay here. Also we have few blocking hook events at MoM like execjob_begin, execjob_preterm and execjob_end.

Third factor is job’s data also gets written to mom_priv/jobs/jobid.JB file. We will see increase in size for these files as well.

I was discussing this with @bayucan and we came to the conclusion that we will need this “m” flag to control what resource goes to MoM and what not.

@ashwathraop i think 5000 characters is on the lower side. Some sites could put a few MB of data on the attributes and TPP is not good at moving large chunks of data, rather is designed to move a large number of small packets fast. So i think if there are a lot of moms, and throughput is really high (as performance keeps increasing) the hit on performance overall will be quite significant (you wont be able to measure this with a single job send of course)

Good discussion! Just to echo a bit of what @subhasisb is saying, we should all remember that PBS Professional is really fast at really big scale, and we want to keep it that way, so performance should be measured at scale. That means looking at how this would impact a workload with 1M jobs a day (or more) running on a system with 1000 (or maybe 5000 or 10,000) MOMs…

Thx!

@subhasisb and @billnitzberg based on your points I did a few tests. They are not at the scale for 1000+ moms or 1M jobs but I could see the effect of sending all the resources to MoM at a much lower scale.

The first test was with the best combination of what I could achieve with the resources I had.

No. of jobs No. of resources on job Data on each custom resource Total data per job on resources_list Time taken by send_job for single job
10000 1 500kb 500kb 0 Second
1000 5 500kb 2.5Mb 0 Second
1000 10 500kb 5Mb 0 Second
1000 50 500kb 25Mb 0 Second
500 100 500kb 50Mb 1 Second
100 250 500kb 75Mb 2 Seconds

As we started to cross the 25mb limit there was a delay of 1 second and at 75mb it became 2 seconds.


Next, I tested the usual scenario with 10 resources each having 4mb of data. And I introduced a mom hook with event execjob_end to see its effect on job execution. All are single node 10-second sleep jobs. All 3 cases I had 50 nodes running 50 jobs. The server had the same load in terms of jobs.

Case 1: With 10 resources each having 4mb data.:-

[root@blrentperf03 work]# grep ‘25.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315

03/15/2019 00:17:35;0008;pbs_mom;Job;25.blrentperf03;Started, pid = 48928

03/15/2019 00:17:45;0080;pbs_mom;Job;25.blrentperf03;task 00000001 terminated

03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;Terminated

03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;task 00000001 cput= 0:00:00

03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;kill_job

03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb

03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;no active tasks

03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;Obit sent

03/15/2019 00:17:51;0080;pbs_mom;Job;25.blrentperf03;delete job request received

03/15/2019 00:17:51;0008;pbs_mom;Job;25.blrentperf03;kill_job

Total time since job execution to clean up: 6 seconds.


Case 2: With 10 resources each having 4mb data and an execjob_end hook:

[root@blrentperf03 work]# grep ‘75.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315

03/15/2019 00:34:22;0008;pbs_mom;Job;75.blrentperf03;Started, pid = 1700

03/15/2019 00:34:32;0080;pbs_mom;Job;75.blrentperf03;task 00000001 terminated

03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;Terminated

03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;task 00000001 cput= 0:00:00

03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;kill_job

03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb

03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;no active tasks

03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;Obit sent

03/15/2019 00:34:38;0080;pbs_mom;Job;75.blrentperf03;delete job request received

03/15/2019 00:34:41;0008;pbs_mom;Job;75.blrentperf03;no active tasks

03/15/2019 00:34:42;0008;pbs_mom;Job;75.blrentperf03;kill_job

Total time since job execution to clean up: 10 seconds.


Case 3: With no resources and an execjob_end hook:

[root@blrentperf03 work]# grep ‘125.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315

03/15/2019 00:38:51;0008;pbs_mom;Job;125.blrentperf03;Started, pid = 2363

03/15/2019 00:39:01;0080;pbs_mom;Job;125.blrentperf03;task 00000001 terminated

03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;Terminated

03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;task 00000001 cput= 0:00:00

03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;kill_job

03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb

03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;no active tasks

03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;Obit sent

03/15/2019 00:39:01;0080;pbs_mom;Job;125.blrentperf03;delete job request received

03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;no active tasks

03/15/2019 00:39:02;0008;pbs_mom;Job;125.blrentperf03;kill_job

Total time taken since job execution to cleanup: 1 second.

I feel hooks are going to get affected more than send_job from the server to mother superior. Since we read and dump all attributes to a file then pbs_python will read the file to load the attributes again, for sure job start to finish time is going to see some delay as the number of resources sent to MoM increases.

1 Like

@ashwathraop the test results look quite good and reasonable.

1 Like

I suspect larger payloads would compress very well and bring down the transfer times. Sending in parallel will dramatically increase scalability.

Thanks for the analysis. Can you please measure how the delay changes as we increase the number of resources sent from 1 to 10? We are going to be sending a limited number of resources with the ‘m’ flag, but not all, so it’ll be worth knowing how the problem scales, we’ll be able to advise admins on how many big resources to put the ‘m’ flag on.

Another thing: will it matter if the system has Infiniband vs normal ethernet? Most sites have IB right? It might be useful to know how bad the delays get with IB. Just a thought.

Many (not all) sites that utilize Infiniband do so for application traffic only and run an independent ethernet network for all other services (including PBS Pro) to keep the traffic segregated and maximize application performance.