当前位置:   article > 正文

K8S中使用英伟达GPU —— 筑梦之路

K8S中使用英伟达GPU —— 筑梦之路

前提条件

根据不同的操作系统,安装好显卡驱动,并能正常识别出来显卡,比如如下截图:

GPU容器创建流程

containerd --> containerd-shim--> nvidia-container-runtime --> nvidia-container-runtime-hook --> libnvidia-container --> runc -- > container-process

GPU驱动安装

  1. # ubuntu系统
  2. apt-get update
  3. apt-get install gcc make
  4. ## cuda10.1
  5. wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-430.50.run
  6. bash NVIDIA-Linux-x86_64-430.50.run
  7. ## cuda10.2
  8. wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-440.100.run
  9. bash NVIDIA-Linux-x86_64-440.100.run
  10. ## cuda11
  11. wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-450.66.run
  12. bash NVIDIA-Linux-x86_64-450.66.run
  13. ## cuda11.4
  14. wget -c https://ops-software-binary-1255440668.cos.ap-chengdu.myqcloud.com/nvidia/NVIDIA-Linux-x86_64-470.57.02.run
  15. bash NVIDIA-Linux-x86_64-470.57.02.run

安装nvidia runtime

  1. https://nvidia.github.io/nvidia-container-runtime/
  2. # ubuntu在线安装
  3. curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
  4. cat > /etc/apt/sources.list.d/nvidia-docker.list <<'EOF'
  5. deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/$(ARCH) /
  6. deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/$(ARCH) /
  7. deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/$(ARCH) /
  8. EOF
  9. apt-get update
  10. apt-get install nvidia-container-runtime
  11. # centos 在线安装
  12. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  13. curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
  14. DIST=$(sed -n 's/releasever=//p' /etc/yum.conf)
  15. DIST=${DIST:-$(. /etc/os-release; echo $VERSION_ID)}
  16. sudo rpm -e gpg-pubkey-f796ecb0
  17. sudo gpg --homedir /var/lib/yum/repos/$(uname -m)/$DIST/nvidia-docker/gpgdir --delete-key f796ecb0
  18. sudo yum makecache
  19. yum -y install nvidia-container-runtime

配置docker/containerd

  1. # docker配置
  2. cat /etc/docker/daemon.json
  3. {
  4. "registry-mirrors": [
  5. "https://wlzfs4t4.mirror.aliyuncs.com"
  6. ],
  7. "max-concurrent-downloads": 10,
  8. "log-driver": "json-file",
  9. "log-level": "warn",
  10. "log-opts": {
  11. "max-size": "10m",
  12. "max-file": "3"
  13. },
  14. "data-root": "/data/var/lib/docker",
  15. "bip": "169.254.31.1/24",
  16. "default-runtime": "nvidia",
  17. "runtimes": {
  18. "nvidia": {
  19. "path": "/usr/bin/nvidia-container-runtime",
  20. "runtimeArgs": []
  21. }
  22. }
  23. }
  24. systemctl restart docker
  25. # containerd配置
  26. cat /etc/containerd/config.toml
  27. #其他的根据自己的需求修改,我这里只说明适配gpu的配置
  28. [plugins]
  29. [plugins."io.containerd.grpc.v1.cri"]
  30. [plugins."io.containerd.grpc.v1.cri".containerd]
  31. #-------------------修改开始-------------------------------------------
  32. default_runtime_name = "nvidia"
  33. #-------------------修改结束-------------------------------------------
  34. [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
  35. #-------------------新增开始-------------------------------------------
  36. [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  37. privileged_without_host_devices = false
  38. runtime_engine = ""
  39. runtime_root = ""
  40. runtime_type = "io.containerd.runc.v2"
  41. [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  42. BinaryName = "/usr/bin/nvidia-container-runtime"
  43. #-------------------新增结束-------------------------------------------
  44. systemctl restart containerd.service

方案一:使用nvidia官方插件

【根据显卡数量分配,独占显卡】

应用yaml分配GPU资源示例:

  1. resources:
  2. limits:
  3. nvidia.com/gpu: '1'
  4. requests:
  5. nvidia.com/gpu: '1'

其中1表示使用1张GPU卡

在Kubernetes中启用GPU支持

  1. # cat nvidia-device-plugin.yaml
  2. apiVersion: apps/v1
  3. kind: DaemonSet
  4. metadata:
  5. name: nvidia-device-plugin-daemonset
  6. namespace: kube-system
  7. spec:
  8. selector:
  9. matchLabels:
  10. name: nvidia-device-plugin-ds
  11. updateStrategy:
  12. type: RollingUpdate
  13. template:
  14. metadata:
  15. labels:
  16. name: nvidia-device-plugin-ds
  17. spec:
  18. tolerations:
  19. - key: nvidia.com/gpu
  20. operator: Exists
  21. effect: NoSchedule
  22. # Mark this pod as a critical add-on; when enabled, the critical add-on
  23. # scheduler reserves resources for critical add-on pods so that they can
  24. # be rescheduled after a failure.
  25. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  26. priorityClassName: "system-node-critical"
  27. containers:
  28. - image: ycloudhub.com/middleware/nvidia-gpu-device-plugin:v0.12.3
  29. name: nvidia-device-plugin-ctr
  30. env:
  31. - name: FAIL_ON_INIT_ERROR
  32. value: "false"
  33. securityContext:
  34. allowPrivilegeEscalation: false
  35. capabilities:
  36. drop: ["ALL"]
  37. volumeMounts:
  38. - name: device-plugin
  39. mountPath: /var/lib/kubelet/device-plugins
  40. volumes:
  41. - name: device-plugin
  42. hostPath:
  43. path: /var/lib/kubelet/device-plugins
  44. # 应用yaml文件并检查
  45. kubectl apply -f nvidia-device-plugin.yml
  46. kubectl get po -n kube-system | grep nvidia
  47. kubectl describe nodes ycloud
  48. ......
  49. Capacity:
  50. cpu: 32
  51. ephemeral-storage: 458291312Ki
  52. hugepages-1Gi: 0
  53. hugepages-2Mi: 0
  54. memory: 131661096Ki
  55. nvidia.com/gpu: 2
  56. pods: 110
  57. Allocatable:
  58. cpu: 32
  59. ephemeral-storage: 422361272440
  60. hugepages-1Gi: 0
  61. hugepages-2Mi: 0
  62. memory: 131558696Ki
  63. nvidia.com/gpu: 2
  64. pods: 110
  65. ......

 方案二:使用第三方插件

【根据显卡显存大小分配,共享显卡】

  1. # 阿里云官方git地址:https://github.com/AliyunContainerService/gpushare-device-plugin/
  2. resources:
  3. limits:
  4. aliyun.com/gpu-mem: '3'
  5. requests:
  6. aliyun.com/gpu-mem: '3'
  7. # 其中3表示使用的显存大小,单位G

 安装gpushare-scheduler-extender插件

参考文档:

https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md

1.修改kube-scheduler配置

  1. # 创建/etc/kubernetes/scheduler-policy-config.json
  2. {
  3. "kind": "Policy",
  4. "apiVersion": "v1",
  5. "extenders": [
  6. {
  7. "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler",
  8. "filterVerb": "filter",
  9. "bindVerb": "bind",
  10. "enableHttps": false,
  11. "nodeCacheCapable": true,
  12. "managedResources": [
  13. {
  14. "name": "aliyun.com/gpu-mem",
  15. "ignoredByScheduler": false
  16. }
  17. ],
  18. "ignorable": false
  19. }
  20. ]
  21. }
  22. # 修改cat /etc/systemd/system/kube-scheduler.service文件,添加--policy-config-file相关内容
  23. cat /etc/systemd/system/kube-scheduler.service
  24. [Unit]
  25. Description=Kubernetes Scheduler
  26. Documentation=https://github.com/GoogleCloudPlatform/kubernetes
  27. [Service]
  28. ExecStart=/usr/local/bin/kube-scheduler \
  29. --address=127.0.0.1 \
  30. --master=http://127.0.0.1:8080 \
  31. --leader-elect=true \
  32. --v=2 \
  33. --policy-config-file=/etc/kubernetes/scheduler-policy-config.json
  34. Restart=on-failure
  35. RestartSec=5
  36. [Install]
  37. WantedBy=multi-user.target
  38. # 重启服务
  39. systemctl daemon-reload
  40. systemctl restart kube-scheduler.service

2. 部署gpushare-schd-extender

  1. curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
  2. kubectl apply -f gpushare-schd-extender.yaml

3.部署device-plugin

  1. # 给节点添加label "gpushare=true"
  2. kubectl label node <target_node> gpushare=true
  3. For example:
  4. kubectl label node mynode gpushare=true
  5. # 部署device-plugin插件
  6. wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
  7. kubectl apply -f device-plugin-rbac.yaml
  8. wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
  9. kubectl apply -f device-plugin-ds.yaml

4.安装kubectl-inspect-gpushare插件,用来查看GPU使用情况

  1. cd /usr/bin/
  2. wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
  3. chmod u+x /usr/bin/kubectl-inspect-gpushare

以上内容仅供参考。

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/992233
推荐阅读
相关标签
  

闽ICP备14008679号