Qdel command is very delayed. No feedback from the system

Greetings,

I have openPBS installed on a workstation. I have noticed that the qdel command is extremely delayed (10-15 seconds). Also there is no change in the status of the job when the command has been submitted.
The job gets deleted successfully after the interval, so it is not a huge issue. However it is incovenient not knowing if the job has been deleted or not and having to wait for every single command submitted.

Please check your DNS resolution for the PBS_SERVER mentioned in the /etc/pbs.conf.

Run the below strace command on your qdel and find out about the bottlenecks

strace qdel jobid

Dear adarsh,

the DNS resolution should be fine. PBS_SERVER has the same name I can retrieve with list server in qmgr.
When i try the strace command I receive a giant output (I would say 100-200 lines).
These are the first lines:

sysadmin@Precision-7920-Tower:~/testMDX/newtest$ strace qdel 5007 > error.log
execve(“/opt/pbs/bin/qdel”, [“qdel”, “5007”], 0x7ffc9ef48e68 / 92 vars /) = 0
brk(NULL) = 0x55e5271a0000
arch_prctl(0x3001 / ARCH_??? /, 0x7fffc362e9e0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f930f089000
access(“/etc/ld.so.preload”, R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/opt/intel/oneapi/vpl/2022.0.0/lib/tls/haswell/avx512_1/x86_64/libpthread.so.0”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat(“/opt/intel/oneapi/vpl/2022.0.0/lib/tls/haswell/avx512_1/x86_64”, 0x7fffc362dc30) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/opt/intel/oneapi/vpl/2022.0.0/lib/tls/haswell/avx512_1/libpthread.so.0”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat(“/opt/intel/oneapi/vpl/2022.0.0/lib/tls/haswell/avx512_1”, 0x7fffc362dc30) = -1 ENOENT (No such file or directory)

the openat and stat lines repeating in a loop many times after this

These are the last few lines:
getsockopt(3, SOL_TCP, TCP_NODELAY, [0], [4]) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
write(3, “PKTV1\0\0\0\0\0<+2+22+21+8sysadmin+0+”…, 71) = 71
poll([{fd=3, events=POLLIN}], 1, 10800000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "PKTV1\0\0\0\0\0", 11) = 11*
poll([{fd=3, events=POLLIN}], 1, 10800000) = 1 ([{fd=3, revents=POLLIN}])
read(3, “+2+2+0+0+6+0+1+02+20Precision-79”…, 42) = 42
write(3, “PKTV1\0\0\0\0\0\35+2+23+100+8sysadmin+1”…, 40) = 40
poll([{fd=3, events=POLLIN}], 1, 10800000) = 1 ([{fd=3, revents=POLLIN}])
read(3, “PKTV1\0\0\0\0\0\20”, 11) = 11
poll([{fd=3, events=POLLIN}], 1, 10800000) = 1 ([{fd=3, revents=POLLIN}])
read(3, “+2+2+0+02+11+0+0”, 16) = 16
write(3, “PKTV1\0\0\0\0\0\22+2+22+59+8sysadmin”, 29) = 29
read(3, “”, 1) = 0
close(3) = 0
exit_group(0) = ?
+++ exited with 0 +++

As I said the output is quite large so I am not going to post all of it. If you want me to look something specific let me know.
The good thing is that after I imput the strace command and I get the output, the job is successfully deleted without delay.

1 Like