PBS allows tensorflow job to run on one node and terminates job when it is running on another node

vincent718 · December 10, 2018, 11:36am

Hello Contributors,

I submitted a tensorflow job on my HPC server and i observed this:
When the job is running on compute node 1, it gets terminated. But when it runs on any other node the job get executed successfully. Is the problem due to the scheduler or an issue with the compute node itself?
Below is the error message.

WARNING:tensorflow:From inception_reimplement_v2.py:236: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

2018-12-10 11:27:39.691795: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-10 11:27:41.422854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-12-10 11:27:41.731735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:82:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-12-10 11:27:41.731816: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-12-10 11:27:42.881204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-10 11:27:42.881258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1
2018-12-10 11:27:42.881269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N N
2018-12-10 11:27:42.881273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: N N
2018-12-10 11:27:42.883945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15127 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0, compute capability: 6.0)
2018-12-10 11:27:43.023229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15127 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
2018-12-10 11:27:45.151367: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.194114: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.222603: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.250662: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.279407: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.307738: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-12-10 11:27:45.882243: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-12-10 11:27:45.882846: W ./tensorflow/stream_executor/stream.h:2023] attempting to perform DNN operation using StreamExecutor without DNN support
2018-12-10 11:27:46.007526: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-12-10 11:27:46.007594: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
/var/spool/pbs/mom_priv/jobs/80112.master1.local.SC: line 14: 57413 Aborted python3.5 inception_reimplement_v2.py

adarsh · December 10, 2018, 6:02pm

Could you please describe your setup ? and also
Please share the output of the below command

tracejob
qstat -fx

<job_id>.<server_name>.SC – > Job script if a job script was provided at qsub

It seems the job script executing on the node has aborted due to issues caused at line 14:

vincent718 · January 13, 2021, 1:53pm

Hi Adarsh. I found out that the cuda being used was not compatible with the tensorflow version. A cuda update solved it .

Topic		Replies	Views
Jobs running on only 1 cpu per node Users/Site Administrators	2	3830	June 11, 2019
Cannot run job on multiple nodes Users/Site Administrators	5	3569	March 21, 2019
Cannot run job on multiple node Users/Site Administrators	5	4100	June 29, 2022
Jobs maybe running in one node, possible reason for getting killed Users/Site Administrators	7	209	July 9, 2024
Cannot run a job on multiple nodes Users/Site Administrators	2	461	March 15, 2024

PBS allows tensorflow job to run on one node and terminates job when it is running on another node

Related topics