For a while now I’ve been kicking around the idea of having an async notification system for job statuses. I’ve seen many applications which are built on top of pbs that submit jobs and then continually poll the server to query the status of their submitted job(s) at some fixed interval. This seems a little inefficient.
I’m envisioning a notification system that is offered by IFL where client applications can “subscribe” to jobs which they wish to be notified about, and the server simply sends back status information in an event-driven manner.
I’ve looked around, but not seen anything currently in place that provides this functionality. Is it already available? Is this something that folks would be interested in?
Seconded. Many, many applications poll the server every second or two to get job status. One concern: the ‘client’ (the app waiting to hear back from the server about job status) would need to keep a connection open waiting for the message, as the client might be on a Windows system that doesn’t allow incoming connections.
@gabe yup, this is exactly what I had in mind, the client keeps a socket open to the server sitting on a multiplexer or similar. When the client “subscribes” he passes a callback that runs when the server sends a status update. On the server side all that’s needed is a subscription list and a socket for each client
Let’s leave a few more days for feedback and then file an RFE to track the work. Please continue to discuss the technical aspects of the design so we may add them to the ticket once it is opened.
@brewlius-cesar your idea is great and we have heard this request from many customers. In fact, we had done some work on this already and have a proof-of-concept that we demonstrated some time back. But the Altair team had been diverted to other enhancements, such that we could not productize that feature so far. The POC we had did quite some stuff as your describe: We had some API to allow the client to register in interested events, and a callback function is invoked when an event occurred. (and we had the client keep a connection open to the server, and the server already mulitplexes our client sockets)
I feel we will greatly benefit if we can discuss about the use cases (requirements) in details.
Do we care only about job status changes, or more? If we only update job status changes, then when a client gets notified about a state change like job went from Q to R, would the client still not hit the server with qstats every 2 seconds to see how the job is progressing?
Thus, do we (in addition to status changes, especially to reduce the need for repeated qstat’s) need:
to be notified about job resource usage changes
to be notified when any attribute in the job changes
And, when we discuss about notification system, we have often heard need for more than jobs. Is there any requirement the community sees in changes in other objects, like reservations or queues, or even nodes?
Rather than clutter the server with all this event generating, I suggest an entirely separate process that statuses the server once every minute or so and collects everything anyone could be interested in (essentially, everything the scheduler asks for). This is the process that clients subscribe to, that watches for changes in status, and notifies the clients of such changes.
By moving this out of the server, development can be much more dynamic.
Yes Dale, that is a great suggestion. In fact in our multiple server architecture, we do exactly that. We have a server that stats the database almost every x seconds to find events and publishes those as per subscriptions. The only gotcha with that design is that we can skip states, which is probably acceptable?
Of course the other way to relieve pbs server of this work is is put a highly performant message queueing system in front of pbs (aka kafka which is capable even for clickstream type data throughput). That way pbs can just write to kafka and be done. Kafka can do all the heavy lifting of managing subscriptions/topics etc and has all kinds of advanced features like load balancing and failover and multiple models of consumer data consumption semantics like atmost-once, or atleast-once consumption.
for us at CNES. We really need such a mecanism, we had some problem in the past with some workflows managers issuing too many qstat and overloading the PBS server. Job status changes will be sufficient here.
We implemented some hooks to send this status changes in a message queuing system (RabbitMQ), but this is still a prototype. Same idea as @subhasisb mentioned about Kafka. We’d loved to use Kafka, but need some more precise management on user’s authorisation for topic subscription (we need a topic per user). Anyway I think relying on such a system is a good idea instead of developping one from scratch.
We also have at our site a separate process that stats the server every x seconds and then keeps this in a Redis map for end-user application consumption. This works ok, but we have 6 sites across the world and ours is the only one who has this, so our in-house software can’t be deployed to use this functionality until everyone gets on the same page, and even then if we need to deploy off-site then we will require other installations to have this same setup. There is also a small latency issue of fixed interval polling which will add up with lots of jobs (we have a couple applications that do a few thousand submissions over the course of their run). As far as performance goes, if we proxy the subscription tasks off to a few worker threads I’d be surprised if there would be much of a performance hit, but our cluster is quite small compared to others (~2500 cores).
I’ve never heard of Kafka before, sounds cool. Would there be something in IFL that wraps the kafka interface though or would the kafka API be exposed directly to programmers?
I would say any job attribute change could be monitored on, the attributes of interest could be passed when the subscription is requested, just like in a pbs_jobstat() call
yes, it would be nice to have this functionality for queues/nodes/etc. Thought about this as I was going through some of the IFL code.
Sorry for the delay in response @brewlius-cesar. The prototype code is not yet available publicly. We hope to start some work on that soon. I agree it is very valid and important feature for everybody.
Ok, well I was thinking about starting some work on this myself. But, if the intent is to have this done internally at your organization then I fear my efforts may be a waste of time…
I am sure we might not be able to get back anytime soon to productizing the proof of concept code we had. Our architecture was based on multiple pbs servers and thus we first need to get that done.
It might be mutually beneficial to start some discussion on what you need exactly and maybe discuss if there are shorter term vs longer term implementation choices…
Hi @brewlius-cesar. Any thoughts on how we can proceed in the short term with the notification feature? We would love to get your work/thoughts into the mainstream.