Routing Queue issue

djkingsley · January 18, 2021, 10:28pm

We have four small development clusters and one larger production cluster located across the country. Each of the clusters has the same version of openPBS installed, configured and working well as standalone systems.

What I am trying to do now is setup a routing queue from each of the small development clusters to the larger production cluster.

I created a routing queue called routeProduction as follows on one of the development clusters:

create queue routeProduction
set queue routeProduction queue_type = Route
set queue routeProduction route_destinations = execution_queue@production-server.domainname
set queue routeProduction enabled = True
set queue routeProduction started = True

When I submit my job to the routeProduction queue on the development cluster, the job looks like it is transferring to the production cluster (state ‘T’) but then ends up sitting on the development cluster with a state of ‘Q’.

The development cluster server logs show:

01/18/2021 17:00:18;0008;Server@dev-sever;Job;5462.dev-sever;Job Queued at request of username@dev-server.domain, owner = username@dev-server.domain, job name = myJobName, queue = routeProduction
01/18/2021 17:00:18;0001;Server@dev-sever;Svr;Server@dev-sever;send_job, production-server_ip: h_errno=1
01/18/2021 17:00:18;0100;Server@dev-sever;Job;5462.dev-sever;dequeuing from routeProduction, state T
01/18/2021 17:00:18;0008;Server@dev-sever;Job;5462.dev-sever;send of job to execution_queue@production-server.domain failed error = 15009
01/18/2021 17:00:19;0100;Server@dev-sever;Req;;Type 0 request received from username@dev-server.domain, sock=19
01/18/2021 17:00:19;0100;Server@dev-sever;Req;;Type 95 request received from username@dev-server.domain, sock=20
01/18/2021 17:00:19;0100;Server@dev-sever;Req;;Type 21 request received from username@dev-server.domain, sock=19
01/18/2021 17:00:19;0100;Server@dev-sever;Req;;Type 19 request received from username@dev-server.domain, sock=19
01/18/2021 17:00:20;0008;Server@dev-sever;Job;5462.dev-sever;send of job to execution_queue@production-server.domain failed error = 15009

on the production cluster I see the following in the server logs

01/18/2021 17:08:04;0100;Server@production-server;Req;;Type 1 request received from root@dev-server_ip, sock=18
01/18/2021 17:08:04;0080;Server@production-server;Req;req_reject;Reject reply code=15009, aux=0, type=1, from root@dev-server_ip
01/18/2021 17:08:06;0100;Server@production-server;Req;;Type 1 request received from root@dev-server_ip, sock=18
01/18/2021 17:08:06;0080;Server@production-server;Req;req_reject;Reject reply code=15009, aux=0, type=1, from root@dev-server_ip

The 15009 seems to indicate that the jobs already exists, any ideas where to look.

one other note, I can submit the job directly to the execution queue on the production cluster using qsub. I can also ssh and scp to and from both clusters without problem.

The logs have been sanitized.

thanks for any insights,
Dennis

Topic		Replies	Views
Routing queues and queuejob event hooks Developers	5	1116	November 19, 2021
Routing queue for other queue Users/Site Administrators	3	1373	November 14, 2019
How to route to multiple execution queues? Users/Site Administrators	8	4507	August 7, 2017
Facing Some Issue with Submitting Jobs on OpenPBS Cluster! Developers	1	44	January 21, 2025
Job stuck in queue, multiple servers Users/Site Administrators	5	1022	September 14, 2022

Routing Queue issue

Related topics