"Event can't be found in new server to be duplicated"

I’m seeing a lot of the message “Event can’t be found in new server to be duplicated” in sched_logs.

Can anyone point me to the part of the documentation that explains what this might mean?

Hey @datakid,
This is one of those messages that you shouldn’t see. I can explain what is going on from the codes point of view, but how that translates into what is happening is going to be hard.

From time to time the scheduler needs a scratch copy of its PBS universe. It duplicates it so we can make modifications to it without touching the real version. The message you are seeing comes from that duplication process. Part of the PBS universe is a calendar of events. These events are when jobs run, jobs end, etc. Part of this event structure is a pointer to the object that the event pertains to. In the case of the message you are seeing is for jobs and reservations. We are trying to duplicate a job/resv start/end event and find that event object (the job or resv), and can’t find it in the duplicated PBS universe.

Like I said above, this is one of those messages you shouldn’t see. There should be no reason we can’t find the job/resv we are looking for.

Can I ask more about your site? What version are you running? Do you use reservations? Do you use job arrays? Do you use backfilling? Do you use preemption?

Bhroam

1 Like

Hi Bhroam, thanks for the reply. I’m new to this job and was looking through the logs when the PBS process stopped responding last night.

I am pretty sure every time I see it in the logs, it’s in relation to arrays.

Yes to reservations, but only really for maintenance and is unlikely to be the cause.

PBSPro 19.1.1

We use preemption but not backfilling.

Cheers!

Hey @datakid,
From what you told me, there are two places this might show up. The first is when we’re confirming reservations. The second is in preemption. From what you say, it sounds like it’s more likely to be preemption that is causing the problem.

How often do you see this log message? When this message happens, it can cause some havoc. The function that dups the event list doesn’t expect a NULL, so it adds the NULL to the list basically chopping the list in half. We’ll leak the front half of the list and return the second half.

If reservations and calendaring aren’t in the picture, this doesn’t mean much. Preemption itself doesn’t use the calendar except to make sure we don’t bump into other run events. The only run events you will have on your calendar will be reservations. In the case you have a near term reservation, preemption might run a high priority job that conflicts with the reservation.

I am unsure what is causing this. Next time you see this issue, can you run pbs_snapshot and attach it to a github issue or open jira issue? It will make looking into this issue easier.

Bhroam

Bhroam,

Thanks for the reply.

We have come up against a problem - which I’m not 100% the solution, but I think I see what it is.

[root@headnode ~]# pbs_snapshot -o /tmp/ -H headnode -l DEBUG --daemon-logs=7 --accounting-logs=7
2019-10-21 14:44:24,736 INFOCLI2 headnode: ssh headnode python -c "import os;print [False, os.environ['PBS_CONF_FILE']]['PBS_CONF_FILE' in os.environ]"
root@headnode's password:
root@headnode's password:
root@headnode's password:

It looks like pbs_snapshot is trying to ssh into itself.

We have ssh disabled for root on the headnode (where pbs server is running). Admin’s can sudo up to it (as I have here) but ssh as root wont work. I’m not seeing anything in the pbs_snapshot help to address this?
[edit] I also get this result when I run as sudo with --with-sudo

Ok, we tried to trick it with a dirty little PATH/ssh hack:

#!/bin/bash
shift 1
exec "$@"

But this just fails further down the snapshot. From the stdout:

2019-10-21 15:16:02,876 ERROR    err: ["qmgr: Illegal operation: 'list", "Try 'help' if you are having trouble."]
2019-10-21 15:16:02,877 DEBUG    rc: 1
Traceback (most recent call last):
  File "/opt/pbs/unsupported/fw/bin/pbs_snapshot.py", line 315, in <module>
    with_sudo=with_sudo) as snap_utils:
  File "/opt/pbs/unsupported/fw/ptl/utils/pbs_snaputils.py", line 247, in __enter__
    self.with_sudo)
  File "/opt/pbs/unsupported/fw/ptl/utils/pbs_snaputils.py", line 366, in __init__
    self.custom_rscs = self.server.parse_resources()
  File "/opt/pbs/unsupported/fw/ptl/lib/pbs_testlib.py", line 7687, in parse_resources
    self.manager(MGR_CMD_LIST, RSC)
  File "/opt/pbs/unsupported/fw/ptl/lib/pbs_testlib.py", line 6676, in manager
    post=self._disconnect, conn=c)
ptl.lib.pbs_testlib.PbsManagerError: rc=1, rv=False, msg=["qmgr: Illegal operation: 'list", "Try 'help' if you are having trouble."]

From the log that is as yet un-tarred, we are seeing:

2019-10-21 15:16:02,836 INFOCLI  headnode: ssh headnode /opt/pbs/bin/qmgr -c 'list resource'

Hey @agrawalravi90 (he wrote pbs_snapshot), can you suggest something here?

Bhroam

It should be smarter, sorry about that, but if you are capturing the local node itself then you don’t need to provide the -H option, so please try running it without that

1 Like

That worked, thank you. There is concern about posting this data to a public git repo issue board. Is there some where more secure I can send it? I’ve included the last 7 days - we aren’t seeing the issue at the moment, but there is plenty of it in the logs of the last seven days - is that enough, or should I keep an eye on the logs and re-run when it’s actually happening?