Thanks @adarsh , there plenty of things that we are already doing with Elastic, the link of “PBS accounting and metrics and Elastic stack” is the closest to what we have, however in this case here’s what I’m trying to do:
I’m building an alternative to qstat, I want to have a lightweight command line tool that users/developers would be able to execute as frequent as they want without adding load on the scheduler.
so the output of the CLI tool would be very similar to the output of qstat when you call ‘qstat <job_id>’, it might not have all of the fields but it needs to have the majority of them.
here’s what we have today:
- we have filebeat forwarding the server_log to logstash, logstash is ingesting the log messages and parses events such as “Job Run”, “Queued Job”, “Job Exited” , it manages to get some fields out of the log entry (e.g. exec_host, qtime, etime etc.)
- we have a hook registered in the RUNJOB and MOVEJOB events where we use the API to extract more fields such as Output_Path, Submit_arguments, etc., we are then injecting the data into Elasticsearch via the REST API
The qstat alternative (the CLI tool) is accessing the Elasticsearch DB via the REST API with very low latency, that allow our users and their scripts to call the alternative CLI tool as much as they won’t without effecting the scheduler.
The limitation here is that we only get the large number of fields (e.g. Output_Path & Submit_arguments) in the JOBRUN hook (when the job started running) we don’t have these fields when the job is in Queued State.
I’m looking for a way to extract these fields straight after the bat and have them injected to our ElasticSearch DB so that the qstat alternative tool can access these fields when the job is still in queued state.
Right now I’m thinking to extract the data straight after when the job is queued within logstash, logstash have a JDBC plugin so that I can query the psql for the “attributes” column, e.g.
SELECT attributes from pbs.job WHERE ji_jobid==’<job_id>’
so the query above will be called about 2-3 seconds after each job is submitted to the queue.
If you are worried about putting stress on the “production server” would it make sense to enable master-slave replication?
do you have any experience doing so?
Thanks,
Roy