使用NVIDIA GPU和SmartNIC的边缘AI-电子发烧友网

第一篇文章描述了如何使用预装驱动程序集成英伟达 GPU 和网络运营商。

本文介绍了以下任务：

清理预安装的驱动程序集成

使用自定义驱动程序容器安装网络运营商

使用自定义驱动程序容器安装 GPU 操作员

NVIDIA 驱动程序集成

预安装的驱动程序集成方法适用于需要签名驱动程序的边缘部署，以实现安全和可测量的引导。当边缘节点具有不可变的操作系统时，请使用驱动程序容器方法。当并非所有边缘节点都有加速器时，驱动程序容器也适用。

清理预安装的驱动程序集成

首先，卸载以前的配置并重新启动以清除预安装的驱动程序。

删除测试播客和网络附件。

$ kubectl delete pod roce-shared-pod
pod "roce-shared-pod" deleted $ kubectl delete macvlannetwork roce-shared-macvlan-network
macvlannetwork.mellanox.com "roce-shared-macvlan-network" deleted

卸载网络运营商掌舵图。

$ helm delete -n network-operator network-operator
release "network-operator" uninstalled

3 .卸载 MOFED 以删除预安装的驱动程序和库。

$ rmmod nvidia_peermem $ /etc/init.d/openibd stop
Unloading HCA driver: [ OK ] $ cd ~/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel7.9-x86_64 $ ./uninstall.sh

4 .拆下 GPU 测试盒。

$ kubectl delete pod cuda-vectoradd
pod "cuda-vectoradd" deleted

5 .卸载英伟达 Linux 驱动程序。

$ ./NVIDIA-Linux-x86_64-470.57.02.run --uninstall

6 .拆下 GPU 操作器。

$ helm uninstall gpu-operator-1634173044

7 .重新启动。

$ sudo shutdown -r now

使用自定义驱动程序容器安装网络运营商

本节介绍使用自定义驱动程序容器安装网络运营商的步骤。

在容器映像中执行的驱动程序构建脚本需要访问目标内核的内核开发包。在本例中，内核开发包是通过 ApacheWeb 服务器提供的。

构建容器后，将其上载到网络运营商 Helm chart 可以从主机访问的存储库。

GPU 操作员将在下一节中使用相同的 web 服务器构建自定义 GPU 操作员驱动程序容器。

安装 Apache web 服务器并启动它。

$ sudo firewall-cmd --state
not running $ sudo yum install createrepo yum-utils httpd -y $ systemctl start httpd.service && systemctl enable httpd.service && systemctl status httpd.service
● httpd.service - The Apache HTTP Server Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2021-10-20 18:10:43 EDT; 4h 45min ago
...

创建上游 CentOS 7 基本软件包存储库的镜像。自定义包存储库在/ var 分区上需要 500 GB 的可用空间。请注意，将所有 CentOS Base 软件包下载到 web 服务器可能需要 10 分钟或更长时间。

$ cd /var/www/html
$ mkdir -p repos/centos/7/x86_64/os
$ reposync -p /var/www/html/repos/centos/7/x86_64/os/ --repo=base --download-metadata -m

3 .将 Linux 内核源文件复制到 web 服务器上的 Base packages 目录中。本例假设使用 rpmbuild 将自定义内核编译为 RPM 。

$ cd repos/centos/7/x86_64/os
$ sudo cp ~/rpmbuild/RPMS/x86_64/*.rpm .

网络运营商需要以下文件：

kernel-headers-${KERNEL_VERSION}
kernel-devel-${KERNEL_VERSION}

确保 GPU 操作员有这些附加文件：

gcc-${GCC_VERSION}
elfutils-libelf.x86_64
elfutils-libelf-devel.x86_64

$ for i in $(rpm -q kernel-headers kernel-devel elfutils-libelf elfutils-libelf-devel gcc | grep -v "not installed"); do ls $i*; done
kernel-headers-3.10.0-1160.42.2.el7.custom.x86_64.rpm
kernel-devel-3.10.0-1160.42.2.el7.custom.x86_64.rpm
elfutils-libelf-0.176-5.el7.x86_64.rpm
elfutils-libelf-devel-0.176-5.el7.x86_64.rpm
gcc-4.8.5-44.el7.x86_64.rpm

4 .浏览到 web 存储库以确保可通过 HTTP 访问该存储库。

$ elinks http://localhost/repos/centos/7/x86_64/os --dump Index of /repos/centos/7/x86_64/os [1][ICO] [2]Name [3]Last modified [4]Size [5]Description -------------------------------------------------------------------------- [6][PARENTDIR] [7]Parent Directory - [8][DIR] [9]base/ 2021-10-21 22:55 - [10][DIR] [11]extras/ 2021-10-02 00:29 - -------------------------------------------------------------------------- References Visible links 2. http://localhost/repos/centos/7/x86_64/os/?C=N;O=D 3. http://localhost/repos/centos/7/x86_64/os/?C=M;O=A 4. http://localhost/repos/centos/7/x86_64/os/?C=S;O=A 5. http://localhost/repos/centos/7/x86_64/os/?C=D;O=A 7. http://localhost/repos/centos/7/x86_64/ 9. http://localhost/repos/centos/7/x86_64/os/base/ 11. http://localhost/repos/centos/7/x86_64/os/extras/

5.MOFED 驱动程序容器映像是根据 Github 上mellanox/ofed-docker存储库中的源代码构建的。克隆 ofed docker 存储库。

$ git clone https://github.com/Mellanox/ofed-docker.git
$ cd ofed-docker/

6 .为自定义驱动程序容器创建生成目录。

$ mkdir centos
$ cd centos/

7 .创建 Dockerfile ，将 MOFED 依赖项和源存档安装到 CentOS 7.9 基本映像中。指定 MOFED 和 CentOS 版本。

$ sudo cat << EOF | tee Dockerfile
FROM centos:centos7.9.2009 ARG D_OFED_VERSION="5.4-1.0.3.0"
ARG D_OS_VERSION="7.9"
ARG D_OS="rhel${D_OS_VERSION}"
ENV D_OS=${D_OS}
ARG D_ARCH="x86_64"
ARG D_OFED_PATH="MLNX_OFED_LINUX-${D_OFED_VERSION}-${D_OS}-${D_ARCH}"
ENV D_OFED_PATH=${D_OFED_PATH} ARG D_OFED_TARBALL_NAME="${D_OFED_PATH}.tgz"
ARG D_OFED_BASE_URL="https://www.mellanox.com/downloads/ofed/MLNX_OFED-${D_OFED_VERSION}"
ARG D_OFED_URL_PATH="${D_OFED_BASE_URL}/${D_OFED_TARBALL_NAME}" ARG D_WITHOUT_FLAGS="--without-rshim-dkms --without-iser-dkms --without-isert-dkms --without-srp-dkms --without-kernel-mft-dkms --without-mlnx-rdma-rxe-dkms"
ENV D_WITHOUT_FLAGS=${D_WITHOUT_FLAGS} # Download and extract tarball
WORKDIR /root
RUN yum install -y curl && (curl -sL ${D_OFED_URL_PATH} | tar -xzf -) RUN yum install -y atk \ cairo \ ethtool \ gcc-gfortran \ git \ gtk2 \ iproute \ libnl3 \ libxml2-python \ lsof \ make \ net-tools \ numactl-libs \ openssh-clients \ openssh-server \ pciutils \ perl \ python-devel \ redhat-rpm-config \ rpm-build \ tcl \ tcsh \ tk \ wget ADD ./entrypoint.sh /root/entrypoint.sh ENTRYPOINT ["/root/entrypoint.sh"]
EOF

8 .修改 ofed docker 存储库中包含的 RHEL entrypoint.sh 脚本，以从 web 服务器安装自定义内核源程序包。在_install_prerequsities()函数中指定 web 服务器上base/Packages目录的路径。

在本例中， 10.150.168.20 是本节前面创建的 web 服务器 IP 地址。

$ cp ../rhel/entrypoint.sh .
$ cat entrypoint.sh
...
# Install the kernel modules header/builtin/order files and generate the kernel version string.
_install_prerequisites() { echo "Installing dependencies" yum -y --releasever=7 install createrepo elfutils-libelf-devel kernel-rpm-macros numactl-libs initscripts grubby linux-firmware libtool echo "Installing Linux kernel headers..." rpm -ivh http://10.150.168.20/repos/centos/7/x86_64/os/base/Packages/kernel-3.10.0-1160.45.1.el7.custom.x86_64.rpm rpm -ivh http://10.150.168.20/repos/centos/7/x86_64/os/base/Packages/kernel-devel-3.10.0-1160.45.1.el7.custom.x86_64.rpm rpm -ivh http://10.150.168.20/repos/centos/7/x86_64/os/base/Packages/kernel-headers-3.10.0-1160.45.1.el7.custom.x86_64.rpm # Prevent depmod from giving a WARNING about missing files touch /lib/modules/${KVER}/modules.order touch /lib/modules/${KVER}/modules.builtin depmod ${KVER}
...

9OFED 驱动程序容器从主机文件系统装载一个目录以共享驱动程序文件。创建目录。

$ mkdir -p /run/mellanox/drivers

10 将新的 CentOS 驱动程序映像上载到注册表。此示例使用 NGC 专用注册表。登录到注册表。

$ sudo yum install -y podman $ sudo podman login nvcr.io
Username: $oauthtoken
Password: *****************************************
Login Succeeded!

11 使用 Podman 构建驱动程序容器映像并将其推送到注册表。

$ sudo podman build --no-cache --tag nvcr.io/nv-ngc5g/mofed-5.4-1.0.3.0:centos7-amd64 .

12 标记图像并将其推送到注册表。

$ sudo podman images nvcr.io | grep mofed
nvcr.io/nv-ngc5g/mofed-5.4-1.0.3.0 centos7-amd64 d61e555bddda 2 minutes ago 1.13 GB

13 覆盖英伟达网络运营商头盔图中包含的 Value.YAML 文件，以安装自定义驱动程序映像。指定自定义驱动程序容器的映像名称、存储库和版本。

$ cat << EOF | sudo tee roce_shared_values_driver.yaml nfd: enabled: false
deployCR: true
ofedDriver: deploy: true image: mofed repository: nvcr.io/nv-ngc5g version: 5.4-1.0.3.0
sriovDevicePlugin: deploy: false
rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] deviceIDs: [101d] ifNames: [ens13f0]
EOF

14 安装英伟达网络运营商的新价值。

$ helm install -f ./roce_shared_values_driver.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

15 查看网络运营商部署的 POD 。 MOFED 吊舱应处于运行状态。这是自定义驱动程序容器。请注意，在启动 pod 之前编译驱动程序可能需要几分钟的时间。

$ kubectl -n nvidia-network-operator-resources get pods
NAME READY STATUS RESTARTS AGE
cni-plugins-ds-zr9kf 1/1 Running 0 10m
kube-multus-ds-w57rz 1/1 Running 0 10m
mofed-centos7-ds-cbs74 1/1 Running 0 10m
rdma-shared-dp-ds-ch8m2 1/1 Running 0 2m27s
whereabouts-z947f 1/1 Running 0 10m

16 验证主机上是否加载了 MOFED 驱动程序。

$ lsmod | egrep '^ib|^mlx|^rdma'
rdma_ucm 27022 0 rdma_cm 65212 1 rdma_ucm
ib_ipoib 124872 0 ib_cm 53085 2 rdma_cm,ib_ipoib
ib_umad 27744 0 mlx5_ib 384793 0 mlx5_core 1360822 1 mlx5_ib
ib_uverbs 132833 2 mlx5_ib,rdma_ucm
ib_core 357959 8 rdma_cm,ib_cm,iw_cm,mlx5_ib,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
mlx_compat 55063 11 rdma_cm,ib_cm,iw_cm,auxiliary,mlx5_ib,ib_core,ib_umad,ib_uverbs,mlx5_core,rdma_ucm,ib_ipoib
mlxfw 22321 1 mlx5_core

17 驱动程序容器的根文件系统应绑定到主机上的/run/mellanox/drivers目录。

$ ls /run/mellanox/drivers
anaconda-post.log bin boot dev etc home host lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

使用自定义驱动程序容器安装 GPU 操作员

本节介绍使用自定义驱动程序容器安装 GPU 操作符的步骤。

与网络运营商一样， GPU 运营商容器执行的驱动程序构建脚本需要访问目标内核的开发包。

本例使用的 web 服务器与上一节中向网络运营商交付开发包的 web 服务器相同。

构建容器后，将其上载到 GPU 操作员 Helm chart 可以从主机访问的存储库。与网络运营商示例一样， GPU 运营商也使用 NGC 上的专用注册表。

构建自定义驱动程序容器。

$ cd ~
$ git clone https://gitlab.com/nvidia/container-images/driver.git
$ cd driver/centos7

2 .更新 CentOS Dockerfile 以使用驱动程序版本 470.74 。注释掉未使用的参数。

$ grep ARG Dockerfile ARG BASE_URL=http://us.download.nvidia.com/XFree86/Linux-x86_64
#ARG BASE_URL=https://us.download.nvidia.com/tesla
ARG DRIVER_VERSION=470.74
ARG DRIVER_TYPE=passthrough
ARG VGPU_LICENSE_SERVER_TYPE=FNE
ARG PUBLIC_KEY=''
#ARG PUBLIC_KEY=empty
ARG PRIVATE_KEY

3 .构建 GPU 驱动程序容器映像并将其推送到 NGC 。

$ sudo podman build --no-cache --tag nvcr.io/nv-ngc5g/driver:470.74-centos7 .

4 .查看 GPU 驱动程序容器图像。

$ podman images nvcr.io | grep 470
nvcr.io/nv-ngc5g/driver 470.74-centos7 630f0f8e77f5 2 minutes ago 1.28 GB

5 .验证为网络运营商安装创建的自定义存储库中是否存在以下文件：

elfutils-libelf.x86_64
elfutils-libelf-devel.x86_64
kernel-headers-${KERNEL_VERSION}
kernel-devel-${KERNEL_VERSION}
gcc-${GCC_VERSION}

编译自定义内核映像的驱动程序需要这些文件。

$ cd /var/www/html/repos/centos/7/x86_64/os/base/Packages/ $ for i in $(rpm -q kernel-headers kernel-devel elfutils-libelf elfutils-libelf-devel gcc | grep -v "not installed"); do ls $i*; done
kernel-headers-3.10.0-1160.45.1.el7.custom.x86_64.rpm
kernel-devel-3.10.0-1160.45.1.el7.custom.x86_64.rpm
elfutils-libelf-0.176-5.el7.x86_64.rpm
elfutils-libelf-devel-0.176-5.el7.x86_64.rpm
gcc-4.8.5-44.el7.x86_64.rpm

6 .与网络运营商不同， GPU 运营商使用自定义的 Yum 存储库配置文件。创建引用自定义镜像存储库的 Yum repo 文件。

$ cd /var/www/html/repos $ cat << EOF | sudo tee custom-repo.repo [base]
name=CentOS Linux $releasever - Base
baseurl=http://10.150.168.20/repos/centos/$releasever/$basearch/os/base/
gpgcheck=0
enabled=1
EOF

7.GPU 运算符使用 Kubernetes ConfigMap 来配置自定义存储库。 ConfigMap 必须在gpu-operator-resources命名空间中可用。创建名称空间和 ConfigMap 。

$ kubectl create ns gpu-operator-resources $ kubectl create configmap repo-config -n gpu-operator-resources --from-file=/var/www/html/repos/custom-repo.repo
configmap/repo-config created $ kubectl describe cm -n gpu-operator-resources repo-config Name: repo-config
Namespace: gpu-operator-resources
Labels: 
Annotations:  Data
====
custom-repo.repo:
----
[base]
name=CentOS Linux $releasever - Base
baseurl=http://10.150.168.20/repos/centos/$releasever/$basearch/os/base/
gpgcheck=0
enabled=1

8 .安装 GPU 操作员舵图。指定自定义存储库位置、自定义驱动程序版本以及自定义驱动程序映像名称和位置。

$ helm install nvidia/gpu-operator --generate-name --set driver.repoConfig.configMapName=repo-config --set driver.repoConfig.destinationDir=/etc/yum.repos.d --set driver.image=driver --set driver.repository=nvcr.io/nv-ngc5g --set-string driver.version="470.74" --set toolkit.version=1.7.1-centos7 --set operator.defaultRuntime=crio

9 查看已部署的吊舱。

$ kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-r6kq6 1/1 Running 0 3m33s
nvidia-container-toolkit-daemonset-62pbj 1/1 Running 0 3m33s
nvidia-cuda-validator-ljd5l 0/1 Completed 0 119s
nvidia-dcgm-9nsfx 1/1 Running 0 3m33s
nvidia-dcgm-exporter-zm82v 1/1 Running 0 3m33s
nvidia-device-plugin-daemonset-bp66r 1/1 Running 0 3m33s
nvidia-device-plugin-validator-8pbmv 0/1 Completed 0 108s
nvidia-driver-daemonset-4tx24 1/1 Running 0 3m33s
nvidia-mig-manager-kvcgc 1/1 Running 0 3m32s
nvidia-operator-validator-g9xz5 1/1 Running 0 3m33s

10 验证驱动程序是否已加载。

$ lsmod | grep nvidia
nvidia_modeset 1195268 0 nvidia_uvm 995356 0 nvidia 35237551 114 nvidia_modeset,nvidia_uvm
drm 456166 5 ast,ttm,drm_kms_helper,nvidia

11 从驱动程序守护程序盒运行 nvidia smi 。

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
Thu Oct 28 02:37:50 2021 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:23:00.0 Off | 0 |
| N/A 25C P0 32W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:E6:00.0 Off | 0 |
| N/A 27C P0 32W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

启用 GPUnDeID-RDMA 的英伟达对等存储器驱动器不是自动构建的。

重复此过程以构建自定义 nvidia peermem 驱动程序容器。

对于 GPU 运营商中的 nvidia peermem 安装程序尚不支持的任何 Linux 操作系统，都需要此附加步骤。

英伟达加速器的未来

NVIDIA 加速器有助于在传感器数据呈指数级增长的情况下，对未来的边缘 AI 投资进行验证。 NVIDIA 运营商是云本地软件，可简化 Kubernetes 上的加速器部署和管理。运营商支持流行的 Kubernetes 开箱即用平台，并且可以定制以支持替代平台。

关于作者

Jacob Liberman 是 NVIDIA 企业和边缘加速集团的产品经理。他利用 20 多年的技术计算经验提供高性能、云计算原生边缘人工智能解决方案。此前，他曾在红帽、 AMD 和戴尔担任产品管理和工程职务。

审核编辑：法人

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉