Recently, I had the experience of working with an air-gapped Red Hat OpenShift cluster with several NVIDIA DGX servers with 8 x A100 GPUs as worker nodes designed for AI. As part of the installation, we were asked to implement the Infiniband and RDMA capabilities of the setup. Needless to say, we were successful—if we weren't I wouldn't be writing this blog!
Following are the steps needed to implement the Infiniband and RDMA features.
This procedure assumes the following:
- You have already installed NFD and NVIDIA GPU Operators and deployed their respective CRs:
- You have correctly set up the network connections to the IB interfaces via the switch when the DGX servers are installed.
You can find all YAMLs used in this document in this repo.
1. Install the SRIOV operator via OperatorHub. To do this, look for the operator in the OperatorHub using the OpenShift UI and install it with the default cluster-wide setup.
Note: If you are in a disconnected environment, you can add it to your cluster using this procedure for oc-mirror.
2. Create SriovIBNetwork (eight of them):
kind: SriovIBNetwork apiVersion: sriovnetwork.openshift.io/v1 metadata: name: ibnetwork{0..7} namespace: openshift-sriov-network-operator spec: ipam: |- { "type" : "whereabouts", "range" : "192.168.{0..7}.X/24" } networkNamespace: default resourceName: rdma_sw{0..7} linkState: enable
3. Create SriovNetworkNodePolicy (eight of them):
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw{0..7} namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" nicSelector: pfNames: - ib{0..7} deviceType: netdevice numVfs: 8 priority: 99 resourceName: rdma_sw<N> isRdma: true linkType: IB
4. Install the NVIDIA network-operator.
Do the following for a disconnected environment:
- Download the helm chart and open it.
- Download and save the needed operator image.
- Update the values file and install.
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia $ helm repo update $ helm pull nvidia/network-operator
Bring the container image into your internal container registry and then update the Helm values.yaml file with your internal container registry to ensure that the image is pulled from your registry and not the Internet:
$ cd network-operator $ helm install network-operator .
5. Create nicClusterPolicy to install the mofed driver:
For a disconnected environment, install the package requirements for the mofed driver as it compiles/builds and installs the driver dynamically:
- Identify the kernel of the DGX worker nodes.
- Create a Dockerfile using the original image as the FROM field and install the required support packages with the following lines in the Dockerfile:
# Replace --releasever=8.4 with your RHEL release repository # Replace $(uname -r) with the kernel version # Below Docker file is good for OCP 4.10.42 # Kernel packages can be downloaded from redhat website # https://access.redhat.com/downloads/content/package-browser FROM nvcr.io/nvidia/mellanox/mofed:23.04-0.5.3.3.1-rhcos4.10-amd64 COPY kernel-core-4.18.0-305.65.1.el8_4.x86_64.rpm /root/ COPY kernel-headers-4.18.0-305.65.1.el8_4.x86_64.rpm /root/ COPY kernel-devel-4.18.0-305.65.1.el8_4.x86_64.rpm /root/ RUN dnf clean all && dnf install --releasever=8.4 /root/kernel-core-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 /root/kernel-headers-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 /root/kernel-devel-4.18.0-305.65.1.el8_4.x86_64.rpm -y && dnf install --releasever=8.4 elfutils-libelf-devel kernel-rpm-macros createrepo numactl-libs -y
When deploying the below YAML, the operator automatically adds a suffix for your OCP Version -rhcos4.10-amd64, so tag your image in an air-gapped environment accordingly:
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy namespace: network-operator spec: ofedDriver: env: - name: RESTORE_DRIVER_ON_POD_TERMINATION value: 'true' - name: UNLOAD_STORAGE_MODULES value: 'true' - name: CREATE_IFNAMES_UDEV value: 'true' # When using internal air-gapped registry with prepared image image: mofed-offline-v2 repository: CONTAINERS_REGISTRY/openshift/nvidia version: "23.04–0.5.3.3.1" # end of air-gapped section upgradePolicy: autoUpgrade: true maxParallelUpgrades: 1 drain: enable: true force: false podSelector: "" timeoutSeconds: 300 deleteEmptyDir: false
6. Enable RDMA in the nvidia-gpu-operator clusterpolicy (make sure the rdma section exists in the driver block and is set to true):
$ oc edit clusterpolicy gpu-cluster-policy \ -n nvidia-gpu-operator gpu-cluster-policy .............. spec: ................... driver: certConfig: name: "" enabled: true kernelModuleConfig: name: "" licensingConfig: configMapName: "" nlsEnabled: false rdma: enabled: true repoConfig: configMapName: "" rollingUpdate: maxUnavailable: "" virtualTopology: ......................
7. Ensure that the mpi-operator is installed. Find it here.
If it is not installed, then apply the YAML at mpi-operator/mpi-operator/deploy/v2beta1/mpi-
operator.yaml after updating image locations.
8. Run MPIJOB in a namespace.
Note: To use the image provided by NVIDIA you need to give the privileged scc to the default service account in the namespace:
$ oc project myProject $ oc adm policy add-scc-to-user privileged -z default
The CONTAINER_REGISTRY variable must be replaced with an internal registry containing the image, or point to one on the Internet:
apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 1 #should equal number of GPUs per worker runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: CONTAINERS_REGISTRY/nvcr.io/nvidia/tensorflow:21.12-tf1-py3 name: tensorflow-benchmarks command: - mpirun - - allow-run-as-root - -np - "2" #Should equal total number of GPUs (# of Workers x # of GPUs per Worker) - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - NCCL_IB_DISABLE=0 - -x - NCCL_NET_GDR_LEVEL=2 - -x - TF_ALLOW_IOLIBS=1 - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - -mca - btl_tcp_if_include - eth0 - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - - batch_size=768 - - model=resnet152 - - variable_update=horovod - - use_fp16=true Worker: replicas: 2 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: |- [ { "name": "ibnetwork1", "namespace" : "default" },{ "name": "ibnetwork2", "namespace" : "default" },{ "name": "ibnetwork3", "namespace" : "default" },{ "name": "ibnetwork4", "namespace" : "default" },{ "name": "ibnetwork5", "namespace" : "default" },{ "name": "ibnetwork6", "namespace" : "default" },{ "name": "ibnetwork7", "namespace" : "default" },{ "name": "ibnetwork8", "namespace" : "default" } ] spec: containers: - image: CONTAINERS_REGISTRY/nvcr.io/nvidia/tensorflow:21.12-tf1-py3 name: tensorflow-benchmarks securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: nvidia.com/gpu: 1 openshift.io/rdma_sw1: 1 openshift.io/rdma_sw2: 1 openshift.io/rdma_sw3: 1 openshift.io/rdma_sw4: 1 openshift.io/rdma_sw5: 1 openshift.io/rdma_sw6: 1 openshift.io/rdma_sw7: 1 openshift.io/rdma_sw8: 1
In the case of an air-gapped environment, after you apply the mpijob, edit the pod to replace image: mpioperator/kubectl-delivery:latest
with the URL of your image's container registry within your disconnected environment:
$ oc edit pod tensorflow-benchmark-launcher
In the Pod information you will have to replace the image: field with the location of the kubectl-delivery:latest
image in your internal registry. Once you have done this you should see the pod go into the INIT stage and then once all workers are ready, be in a running stage. All logs going forward will be seen in the launcher pod.
Note the following choices:
- To disable RDMA, set NCCL_NET_GDR_LEVEL=2 to 0.
- To disable Infiniband, set NCCL_IB_DISABLE=0 to 1 and remove the annotations from the worker section.
To confirm that Infiniband is being used, look for the following in the log of the launcher pod:
……. tensorflow-benchmarks-worker-2:23:88 [0] NCCL INFO Channel 00 : 1[bd000] -> 2[90000] [receive] via NET/IBext/4/GDRDMA tensorflow-benchmarks-worker-0:23:88 [0] NCCL INFO Channel 00 : 7[87000] -> 0[87000] [receive] via NET/IBext/4/GDRDMA tensorflow-benchmarks-worker-3:23:88 [0] NCCL INFO Channel 00 : 2[90000] -> 3[b7000] [receive] via NET/IBext/6/GDRDMA tensorflow-benchmarks-worker-1:23:88 [0] NCCL INFO Channel 00 : 0[87000] -> 1[bd000] [receive] via NET/IBext/6/GDRDMA tensorflow-benchmarks-worker-4:23:88 [0] NCCL INFO Channel 00 : 3[b7000] -> 4[90000] …………………..
In my trials, I've had more than 10 times improvement in the speed it takes to train a model.
Wrap up
These steps implement the Infiniband and RDMA features in air-gapped Red Hat OpenShift environments. Be sure to use the linked YAML files and repositories to help you get started with your own setup.
Sobre el autor
Senior Cloud Architect working with a variety of customers in the defense sector. Extensive experience designing and implementing OpenShift ML air-gapped environments and CI/CD solutions, as well as integrating with third-party software.
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit