Performance penalty for accessing the psql DB

Hi Guys,
In our company we’ve built a log analysis platform based on Elasticsearch, the server_logs and accounting_logs are being injected into Elasticsearch, we have dashboards and tools that queries the data in the Elasticsearch DB.

There’s a limitation with the amount of data that you get in the logs, e.g. when a job is submitted and stayed queued you don’t get access to the Output_Path and Error_Path, to get that data you would need to call qstat every time a new job is submitted and I would like to avoid doing that because it would put stress on the scheduler.

I would like to access the psql once every time a new job is submitted and extract all of the fields from the psql DB.

What are your thoughts? is that a scalable solution? would that also put pressure on the scheduler?
After the data is extracted from psql and injected into Elasticsearch then our dashboards and tools would use the ElasticSearch DB and the psql DB would be left alone, it’s only once per job submission.

I would like to hear your thoughts on the subject, I don’t want to bog down the scheduler but I won’t to improve our data analytics platform.

Thanks,
Roy

Thats right, it would be load on the PBS Server

I am not sure whether this is a recommended method, that is accessing the database externally when the system is in production. Instead, you can take a snapshot every 30 seconds and gather information from it

Any repeated or high frequency or watching qstat’ / q commands will put pressure on the PBS Server.
If you could let us know the end goal of using Elastic Search that would be helpful. The data in the postgres database are updated periodically based on the changes to the configuration of queue, nodes , jobs, job history etc.

Please refer the below links and they might helpful

Thank you

Thanks @adarsh , there plenty of things that we are already doing with Elastic, the link of “PBS accounting and metrics and Elastic stack” is the closest to what we have, however in this case here’s what I’m trying to do:

I’m building an alternative to qstat, I want to have a lightweight command line tool that users/developers would be able to execute as frequent as they want without adding load on the scheduler.

so the output of the CLI tool would be very similar to the output of qstat when you call ‘qstat <job_id>’, it might not have all of the fields but it needs to have the majority of them.

here’s what we have today:

  1. we have filebeat forwarding the server_log to logstash, logstash is ingesting the log messages and parses events such as “Job Run”, “Queued Job”, “Job Exited” , it manages to get some fields out of the log entry (e.g. exec_host, qtime, etime etc.)
  2. we have a hook registered in the RUNJOB and MOVEJOB events where we use the API to extract more fields such as Output_Path, Submit_arguments, etc., we are then injecting the data into Elasticsearch via the REST API

The qstat alternative (the CLI tool) is accessing the Elasticsearch DB via the REST API with very low latency, that allow our users and their scripts to call the alternative CLI tool as much as they won’t without effecting the scheduler.

The limitation here is that we only get the large number of fields (e.g. Output_Path & Submit_arguments) in the JOBRUN hook (when the job started running) we don’t have these fields when the job is in Queued State.

I’m looking for a way to extract these fields straight after the bat and have them injected to our ElasticSearch DB so that the qstat alternative tool can access these fields when the job is still in queued state.

Right now I’m thinking to extract the data straight after when the job is queued within logstash, logstash have a JDBC plugin so that I can query the psql for the “attributes” column, e.g.
SELECT attributes from pbs.job WHERE ji_jobid==’<job_id>’

so the query above will be called about 2-3 seconds after each job is submitted to the queue.
If you are worried about putting stress on the “production server” would it make sense to enable master-slave replication?

do you have any experience doing so?

Thanks,
Roy

1 Like

Very impressive indeed. You have touched upon all the important points and filling the blanks. Spot on!. I do not have experience on the replication , copying @subhasisb @agrawalravi90 for any insights they can share on this.

yeah - some sites use postgres replication (of the PBS database) for backups - you can use it as a hot standby (or a read only) database, of course. You would want to set the replication to asynchronous mode to not hurt production performance.

  • Subhasis

Thank you @subhasisb , do you have any blog/tutorial online that describes the process?

Nothing special actually. The online tutorials from the postgres community works great actually. If you face any issues setting this up, perhaps I can try to help…