Asynchronous Job Status Notification

brewlius-cesar · January 26, 2018, 4:08pm

For a while now I’ve been kicking around the idea of having an async notification system for job statuses. I’ve seen many applications which are built on top of pbs that submit jobs and then continually poll the server to query the status of their submitted job(s) at some fixed interval. This seems a little inefficient.

I’m envisioning a notification system that is offered by IFL where client applications can “subscribe” to jobs which they wish to be notified about, and the server simply sends back status information in an event-driven manner.

I’ve looked around, but not seen anything currently in place that provides this functionality. Is it already available? Is this something that folks would be interested in?

mkaro · January 26, 2018, 4:29pm

@brewlius-cesar : Thank you for your suggestion! Personally, I like the idea. Let’s see how many others in the community are in favor of it.

gabe · January 27, 2018, 5:31pm

Seconded. Many, many applications poll the server every second or two to get job status. One concern: the ‘client’ (the app waiting to hear back from the server about job status) would need to keep a connection open waiting for the message, as the client might be on a Windows system that doesn’t allow incoming connections.

brewlius-cesar · January 28, 2018, 3:13am

@gabe yup, this is exactly what I had in mind, the client keeps a socket open to the server sitting on a multiplexer or similar. When the client “subscribes” he passes a callback that runs when the server sends a status update. On the server side all that’s needed is a subscription list and a socket for each client

mkaro · January 29, 2018, 6:58pm

Let’s leave a few more days for feedback and then file an RFE to track the work. Please continue to discuss the technical aspects of the design so we may add them to the ticket once it is opened.

brewlius-cesar · January 30, 2018, 5:26am

I’m pretty new to PBSPro development, but here goes a high-level:

Client Side (say pbs_jobsubscribe() )
- User interface is basically the same as pbs_statjob() but accepts a callback function who’s argument is the result of a statjob call
- On first call pbs_jobsubscribe() opens a persistent TCP connection to the server and spawns a thread
- Future calls just add/modify the callback pointer (add if new job id for subscription, modify if already existing)
- The spawned thread sits and waits for status replies from the server
- Once the job is no longer being tracked by the server it sends a finalize message and the callback is removed
Server Side
- Accept requests for subscriptions on the main server socket
- Track socket and subscription settings (job id to track, perhaps attr list) in a subscription list
- First time clients get their socket added to the subscription list
- Future requests modify existing entry
- Send notifications to subscribers when a jobs attributes change (perhaps when job is saved to db?)
- When a job is no longer tracked send a finalize message (3-way comm for this?)
- Cleanup dead subscriptions

I’d be interested in helping out with the development, but could use the mentorship of a seasoned PBSPro hacker.

subhasisb · January 31, 2018, 6:49am

@brewlius-cesar your idea is great and we have heard this request from many customers. In fact, we had done some work on this already and have a proof-of-concept that we demonstrated some time back. But the Altair team had been diverted to other enhancements, such that we could not productize that feature so far. The POC we had did quite some stuff as your describe: We had some API to allow the client to register in interested events, and a callback function is invoked when an event occurred. (and we had the client keep a connection open to the server, and the server already mulitplexes our client sockets)

I feel we will greatly benefit if we can discuss about the use cases (requirements) in details.

Do we care only about job status changes, or more? If we only update job status changes, then when a client gets notified about a state change like job went from Q to R, would the client still not hit the server with qstats every 2 seconds to see how the job is progressing?

Thus, do we (in addition to status changes, especially to reduce the need for repeated qstat’s) need:

to be notified about job resource usage changes
to be notified when any attribute in the job changes

And, when we discuss about notification system, we have often heard need for more than jobs. Is there any requirement the community sees in changes in other objects, like reservations or queues, or even nodes?

dtalcott · January 31, 2018, 5:36pm

Rather than clutter the server with all this event generating, I suggest an entirely separate process that statuses the server once every minute or so and collects everything anyone could be interested in (essentially, everything the scheduler asks for). This is the process that clients subscribe to, that watches for changes in status, and notifies the clients of such changes.

By moving this out of the server, development can be much more dynamic.

We are doing something like this already.

https://www.nas.nasa.gov/hecc/support/kb/using-the-mynas-mobile-app_465.html

subhasisb · February 1, 2018, 10:29am

Yes Dale, that is a great suggestion. In fact in our multiple server architecture, we do exactly that. We have a server that stats the database almost every x seconds to find events and publishes those as per subscriptions. The only gotcha with that design is that we can skip states, which is probably acceptable?

subhasisb · February 1, 2018, 10:31am

Of course the other way to relieve pbs server of this work is is put a highly performant message queueing system in front of pbs (aka kafka which is capable even for clickstream type data throughput). That way pbs can just write to kafka and be done. Kafka can do all the heavy lifting of managing subscriptions/topics etc and has all kinds of advanced features like load balancing and failover and multiple models of consumer data consumption semantics like atmost-once, or atleast-once consumption.

guillaumeeb · February 1, 2018, 11:56am

for us at CNES. We really need such a mecanism, we had some problem in the past with some workflows managers issuing too many qstat and overloading the PBS server. Job status changes will be sufficient here.

We implemented some hooks to send this status changes in a message queuing system (RabbitMQ), but this is still a prototype. Same idea as @subhasisb mentioned about Kafka. We’d loved to use Kafka, but need some more precise management on user’s authorisation for topic subscription (we need a topic per user). Anyway I think relying on such a system is a good idea instead of developping one from scratch.

brewlius-cesar · February 1, 2018, 11:59pm

Whoa, lots of feedback, cool!

We also have at our site a separate process that stats the server every x seconds and then keeps this in a Redis map for end-user application consumption. This works ok, but we have 6 sites across the world and ours is the only one who has this, so our in-house software can’t be deployed to use this functionality until everyone gets on the same page, and even then if we need to deploy off-site then we will require other installations to have this same setup. There is also a small latency issue of fixed interval polling which will add up with lots of jobs (we have a couple applications that do a few thousand submissions over the course of their run). As far as performance goes, if we proxy the subscription tasks off to a few worker threads I’d be surprised if there would be much of a performance hit, but our cluster is quite small compared to others (~2500 cores).

I’ve never heard of Kafka before, sounds cool. Would there be something in IFL that wraps the kafka interface though or would the kafka API be exposed directly to programmers?

Also, to get back to @subhasisb first questions:

I would say any job attribute change could be monitored on, the attributes of interest could be passed when the subscription is requested, just like in a pbs_jobstat() call
yes, it would be nice to have this functionality for queues/nodes/etc. Thought about this as I was going through some of the IFL code.

brewlius-cesar · February 27, 2018, 1:01pm

Well, I hope my comment didn’t scare folks off.

@subhasisb is the prototype code for this feature available in the public PBSPro repo?

mkaro · February 27, 2018, 6:59pm

@brewlius-cesar you didn’t scare anyone off.

We’re all busy fielding a multitude of things. This topic is still quite valid.

subhasisb · April 2, 2018, 3:41am

Sorry for the delay in response @brewlius-cesar. The prototype code is not yet available publicly. We hope to start some work on that soon. I agree it is very valid and important feature for everybody.

brewlius-cesar · April 20, 2018, 12:22pm

Ok, well I was thinking about starting some work on this myself. But, if the intent is to have this done internally at your organization then I fear my efforts may be a waste of time…

subhasisb · April 24, 2018, 1:47pm

I am sure we might not be able to get back anytime soon to productizing the proof of concept code we had. Our architecture was based on multiple pbs servers and thus we first need to get that done.

It might be mutually beneficial to start some discussion on what you need exactly and maybe discuss if there are shorter term vs longer term implementation choices…

subhasisb · May 15, 2018, 6:16am

Hi @brewlius-cesar. Any thoughts on how we can proceed in the short term with the notification feature? We would love to get your work/thoughts into the mainstream.

regards,
Subhasis

Topic		Replies	Views
User job email notification logs Users/Site Administrators	0	499	November 27, 2022
Released a small web app to show cluster status Users/Site Administrators	20	2572	July 13, 2022
In the development of the cluster management system, we use PBSPro as a tool for job management, how to get the job information in real time, through the database or command line query? Developers	4	2204	September 28, 2018
PP-479: Running subjobs to be able to survive a pbs_server restart Developers	41	4220	May 14, 2018
PP-928: Reliable Job Startup Developers	44	3956	September 20, 2018

Asynchronous Job Status Notification

Related topics