Nvidia Infiniband on a Red Hat OpenShift connected or air-gapped cluster

21 de noviembre de 2023Guy Rakover10 minutos de lectura

Recently, I had the experience of working with an air-gapped Red Hat OpenShift cluster with several NVIDIA DGX servers with 8 x A100 GPUs as worker nodes designed for AI. As part of the installation, we were asked to implement the Infiniband and RDMA capabilities of the setup. Needless to say, we were successful—if we weren't I wouldn't be writing this blog!

Following are the steps needed to implement the Infiniband and RDMA features.

This procedure assumes the following:

You have already installed NFD and NVIDIA GPU Operators and deployed their respective CRs:
- NFD
- GPU Operator
You have correctly set up the network connections to the IB interfaces via the switch when the DGX servers are installed.

You can find all YAMLs used in this document in this repo.

1. Install the SRIOV operator via OperatorHub. To do this, look for the operator in the OperatorHub using the OpenShift UI and install it with the default cluster-wide setup.

Note: If you are in a disconnected environment, you can add it to your cluster using this procedure for oc-mirror.

2. Create SriovIBNetwork (eight of them):

kind: SriovIBNetwork
apiVersion: sriovnetwork.openshift.io/v1
metadata:
  name: ibnetwork{0..7}
  namespace: openshift-sriov-network-operator
spec:
  ipam: |-
    {
      "type" : "whereabouts",
      "range" : "192.168.{0..7}.X/24"
    }
  networkNamespace: default
  resourceName: rdma_sw{0..7}
  linkState: enable

3. Create SriovNetworkNodePolicy (eight of them):

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics-sw{0..7}
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/custom-rdma.capable: "true"
  nicSelector:
  pfNames:
  - ib{0..7}
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: rdma_sw<N>
  isRdma: true
  linkType: IB

4. Install the NVIDIA network-operator.

Do the following for a disconnected environment:

Download the helm chart and open it.
Download and save the needed operator image.
Update the values file and install.

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm pull nvidia/network-operator

Bring the container image into your internal container registry and then update the Helm values.yaml file with your internal container registry to ensure that the image is pulled from your registry and not the Internet:

$ cd network-operator
$ helm install network-operator .

5. Create nicClusterPolicy to install the mofed driver:

For a disconnected environment, install the package requirements for the mofed driver as it compiles/builds and installs the driver dynamically:

Identify the kernel of the DGX worker nodes.
Create a Dockerfile using the original image as the FROM field and install the required support packages with the following lines in the Dockerfile:

# Replace --releasever=8.4 with your RHEL release repository
# Replace $(uname -r) with the kernel version
# Below Docker file is good for OCP 4.10.42
# Kernel packages can be downloaded from redhat website
# https://access.redhat.com/downloads/content/package-browser
FROM nvcr.io/nvidia/mellanox/mofed:23.04-0.5.3.3.1-rhcos4.10-amd64
COPY kernel-core-4.18.0-305.65.1.el8_4.x86_64.rpm /root/
COPY kernel-headers-4.18.0-305.65.1.el8_4.x86_64.rpm /root/
COPY kernel-devel-4.18.0-305.65.1.el8_4.x86_64.rpm /root/
RUN dnf clean all && dnf install --releasever=8.4 /root/kernel-core-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 /root/kernel-headers-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 /root/kernel-devel-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 elfutils-libelf-devel kernel-rpm-macros createrepo numactl-libs -y

When deploying the below YAML, the operator automatically adds a suffix for your OCP Version -rhcos4.10-amd64, so tag your image in an air-gapped environment accordingly:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
  namespace: network-operator
spec:
  ofedDriver:
    env:
    - name: RESTORE_DRIVER_ON_POD_TERMINATION
      value: 'true'
    - name: UNLOAD_STORAGE_MODULES
      value: 'true'
    - name: CREATE_IFNAMES_UDEV
      value: 'true'
    # When using internal air-gapped registry with prepared image
    image: mofed-offline-v2
    repository: CONTAINERS_REGISTRY/openshift/nvidia
    version: "23.04–0.5.3.3.1"
    # end of air-gapped section 
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
    drain:
      enable: true
      force: false
      podSelector: ""
      timeoutSeconds: 300
      deleteEmptyDir: false

6. Enable RDMA in the nvidia-gpu-operator clusterpolicy (make sure the rdma section exists in the driver block and is set to true):

$ oc edit clusterpolicy gpu-cluster-policy \
-n nvidia-gpu-operator

gpu-cluster-policy
..............
spec:
...................
driver:
 certConfig:
   name: ""
 enabled: true
 kernelModuleConfig:
   name: ""
 licensingConfig:
   configMapName: ""
 nlsEnabled: false
 rdma:
   enabled: true
 repoConfig:
   configMapName: ""
 rollingUpdate:
 maxUnavailable: ""
 virtualTopology:
......................

7. Ensure that the mpi-operator is installed. Find it here.

If it is not installed, then apply the YAML at mpi-operator/mpi-operator/deploy/v2beta1/mpi-
operator.yaml after updating image locations.

8. Run MPIJOB in a namespace.

Note: To use the image provided by NVIDIA you need to give the privileged scc to the default service account in the namespace:

$ oc project myProject
$ oc adm policy add-scc-to-user privileged -z default

The CONTAINER_REGISTRY variable must be replaced with an internal registry containing the image, or point to one on the Internet:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-benchmarks
spec:
  slotsPerWorker: 1 #should equal number of GPUs per worker
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: CONTAINERS_REGISTRY/nvcr.io/nvidia/tensorflow:21.12-tf1-py3
            name: tensorflow-benchmarks
            command:
            - mpirun
            - - allow-run-as-root
            - -np
            - "2" #Should equal total number of GPUs (# of Workers x # of GPUs per Worker)
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - NCCL_IB_DISABLE=0
            - -x
            - NCCL_NET_GDR_LEVEL=2
            - -x
            - TF_ALLOW_IOLIBS=1
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - -mca
            - btl_tcp_if_include
            - eth0
            - python
            - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            - - batch_size=768
            - - model=resnet152
            - - variable_update=horovod
            - - use_fp16=true
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: |-
              [
                {
                  "name": "ibnetwork1",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork2",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork3",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork4",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork5",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork6",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork7",
                  "namespace" : "default"
                },{
                  "name": "ibnetwork8",
                  "namespace" : "default"
                }
            ]
        spec:
          containers:
          - image: CONTAINERS_REGISTRY/nvcr.io/nvidia/tensorflow:21.12-tf1-py3
            name: tensorflow-benchmarks
            securityContext:
              capabilities:
                add: ["IPC_LOCK"]
            resources:
              limits:
                nvidia.com/gpu: 1
                openshift.io/rdma_sw1: 1
                openshift.io/rdma_sw2: 1
                openshift.io/rdma_sw3: 1
                openshift.io/rdma_sw4: 1
                openshift.io/rdma_sw5: 1
                openshift.io/rdma_sw6: 1
                openshift.io/rdma_sw7: 1
                openshift.io/rdma_sw8: 1

In the case of an air-gapped environment, after you apply the mpijob, edit the pod to replace image: mpioperator/kubectl-delivery:latest with the URL of your image's container registry within your disconnected environment:

$ oc edit pod tensorflow-benchmark-launcher

In the Pod information you will have to replace the image: field with the location of the kubectl-delivery:latest image in your internal registry. Once you have done this you should see the pod go into the INIT stage and then once all workers are ready, be in a running stage. All logs going forward will be seen in the launcher pod.

Note the following choices:

To disable RDMA, set NCCL_NET_GDR_LEVEL=2 to 0.
To disable Infiniband, set NCCL_IB_DISABLE=0 to 1 and remove the annotations from the worker section.

To confirm that Infiniband is being used, look for the following in the log of the launcher pod:

…….
tensorflow-benchmarks-worker-2:23:88 [0] NCCL INFO Channel 00 : 1[bd000] -> 2[90000]
[receive] via NET/IBext/4/GDRDMA
tensorflow-benchmarks-worker-0:23:88 [0] NCCL INFO Channel 00 : 7[87000] -> 0[87000]
[receive] via NET/IBext/4/GDRDMA
tensorflow-benchmarks-worker-3:23:88 [0] NCCL INFO Channel 00 : 2[90000] -> 3[b7000]
[receive] via NET/IBext/6/GDRDMA
tensorflow-benchmarks-worker-1:23:88 [0] NCCL INFO Channel 00 : 0[87000] -> 1[bd000]
[receive] via NET/IBext/6/GDRDMA
tensorflow-benchmarks-worker-4:23:88 [0] NCCL INFO Channel 00 : 3[b7000] -> 4[90000]
…………………..

In my trials, I've had more than 10 times improvement in the speed it takes to train a model.

Wrap up

These steps implement the Infiniband and RDMA features in air-gapped Red Hat OpenShift environments. Be sure to use the linked YAML files and repositories to help you get started with your own setup.

Sobre el autor

Guy Rakover

Senior Cloud Architect

Senior Cloud Architect working with a variety of customers in the defense sector. Extensive experience designing and implementing OpenShift ML air-gapped environments and CI/CD solutions, as well as integrating with third-party software.

Read full bio

Navegar por canal

Explore todos los canales

Plataformas

Pruebe y compre

Destacados

Por sector

Destacados

Temas

Artículos

Vea también

Para los clientes

Para los partners

Quiénes somos

Open source

Detalles de la empresa

Recomendaciones

Seleccionar idioma

Seleccionar idioma

Nvidia Infiniband on a Red Hat OpenShift connected or air-gapped cluster

Wrap up

Sobre el autor

Guy Rakover

Más similar

Navegar por canal

Productos

Herramientas

Realice pruebas, compras y ventas

Comunicarse

Acerca de Red Hat

Seleccionar idioma

Red Hat legal and privacy links

Red Hat legal and privacy links