Hello,
i found the problem We use Supermicro Server with an onboard pci switch, so the pci ids are “wrong”.
GPUs: {'nvidia4': '00000000:08:00.0', 'nvidia5': '00000000:0B:00.0', 'nvidia6': '00000000:0C:00.0', 'nvidia7': '00000000:0D:00.0', 'nvidia0': '00000000:04:00.0', 'nvidia1': '00000000:05:00.0', 'nvidia2': '00000000:06:00.0', 'nvidia3': '00000000:07:00.0', 'nvidia8': '00000000:0E:00.0', 'nvidia9': '00000000:0F:00.0'}
But the devices have other ids on the pci dive list:
card3': {'realpath': '/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:0c.0/0000:06:00.0/drm/card3', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card3', 'bus_id': '0000:00:02.0', 'minor': 3}, 'card2': {'realpath': '/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:08.0/0000:05:00.0/drm/card2', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card2', 'bus_id': '0000:00:02.0', 'minor': 2}, 'card1': {'realpath': '/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:04.0/0000:04:00.0/drm/card1', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card1', 'bus_id': '0000:00:02.0', 'minor': 1}, 'card0': {'realpath': '/sys/devices/pci0000:00/0000:00:1c.7/0000:11:00.0/0000:12:00.0/drm/card0', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card0', 'bus_id': '0000:00:1c.7', 'minor': 0}, 'card7': {'realpath': '/sys/devices/pci0000:00/0000:00:03.0/0000:09:00.0/0000:0a:08.0/0000:0c:00.0/drm/card7', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card7', 'bus_id': '0000:00:03.0', 'minor': 7}, 'card6': {'realpath': '/sys/devices/pci0000:00/0000:00:03.0/0000:09:00.0/0000:0a:04.0/0000:0b:00.0/drm/card6', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card6', 'bus_id': '0000:00:03.0', 'minor': 6}, 'card5': {'realpath': '/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:14.0/0000:08:00.0/drm/card5', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card5', 'bus_id': '0000:00:02.0', 'minor': 5}, 'card4': {'realpath': '/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:10.0/0000:07:00.0/drm/card4', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card4', 'bus_id': '0000:00:02.0', 'minor': 4}, 'card10': {'realpath': '/sys/devices/pci0000:00/0000:00:03.0/0000:09:00.0/0000:0a:14.0/0000:0f:00.0/drm/card10', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card10', 'bus_id': '0000:00:03.0', 'minor': 10}, 'card8': {'realpath': '/sys/devices/pci0000:00/0000:00:03.0/0000:09:00.0/0000:0a:0c.0/0000:0d:00.0/drm/card8', 'major': 226, 'type': 'c', 'numa_node': 0, 'device': '/dev/dri/card8', 'bus_id': '0000:00:03.0', 'minor': 8},
Maybe it is a bug in nvidia-smi but i think i need to find a workaround inside the hook