pytorch-模型压缩与剪枝_tensorrt怎么剪枝

作者：不正经 | 2024-04-02 06:06:52

踩

tensorrt怎么剪枝

一、模型量化

pytorch模型加速方法：1、模型量化：半精度。2、模型剪枝：tensorrt

一、模型量化

PyTorch框架提供了一个方便好用的trick：开启半精度。直接可以加快运行速度、减少GPU占用，并且只有不明显的accuracy损失。

之前做硬件加速的时候，尝试过多种精度的权重和偏置。在FPGA里用8位精度和16位精度去处理MNIST手写数字识别，完全可以达到差不多的准确率，并且可以节省一半的资源消耗。这一思想用到GPU里也是完全可以行通的。即将pytorch默认的32位浮点型都改成16位浮点型。

只需：

model.half()

注意1：这一步要放在模型载入GPU之前，即放到model.cuda()之前。大概步骤就是：


model.half()
 
model.cuda()
 
model.eval()

注意2：模型改为半精度以后，输入也需要改成半精度。步骤大概是：


model.half()
 
model.cuda()
 
model.eval()
 
img = torch.from_numpy(image).float()
 
img = img.cuda()
 
img = img.half()
 
res = model(img)

本地做的测试结果为：速度提升25%~35%，显存节约40~60%，而accuracy几乎没变。仅供大家参考。

二、模型剪枝

TensorRT简介和安装

TensorRT是Nvidia官方给的C++推理加速工具，如同OpenVINO之于Intel。支持诸多的AI框架，如Tensorflow，Pytorch，Caffe，MXNet等。此外还对某些热门框架有特别的照顾，比如针对PyTorch有直接转换的工具torch2trt（咱们一会儿说）。

链接0：https://developer.nvidia.com/tensorrt

链接1：https://github.com/NVIDIA/TensorRT

链接2：https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html

链接3：https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/

TensorRT的主要原理是在GPU的硬件部署上对模型进行推理优化，合理分配硬件资源使得计算并行度更高。你可以根据你要的版本，下载tensorRT:

https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/6.0/GA_6.0.1.5/tars/TensorRT-6.0.1.5.Ubuntu-16.04.x86_64-gnu.cuda-10.1.cudnn7.6.tar.gz

上面这串链接需要先登录nvidia账号，薛微有一点麻烦。一劳永逸嘛，不寒掺。

解压之：

tar -xvzf TensorRT-6.0.1.5.Ubuntu-16.04.x86_64-gnu.cuda-10.1.cudnn7.6.tar.gz

export之：


 
export TRT_RELEASE=`pwd`/TensorRT-6.0.1.5
 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TRT_RELEASE/lib

安装python模块：


 
cd TensorRT-6.0.1.5/python/
 
pip install tensorrt-6.0.1.5-cp37-none-linux_x86_64.whl

完毕~

Pytorch模型转TensorRT

这里用到的库是torch2trt，安装这个库要保证TensorRT已经安装好，而且cudatoolkit是对应PyTorch版本的(之前debug了好久发现的问题，希望你们不会遇到)。

看链接：https://github.com/NVIDIA-AI-IOT/torch2trt

本来点进去按照人家的README.md操作就行了。或者，看看我的踩坑经验。安装：


 
git clone https://github.com/NVIDIA-AI-IOT/torch2trt
 
cd torch2trt
 
python setup.py install

这是原demo：


import torch
 
from torch2trt import torch2trt
 
from torchvision.models.alexnet import alexnet
 
 
# create some regular pytorch model...
 
model = alexnet(pretrained=True).eval().cuda()
 
 
# create example data
 
x = torch.ones((1, 3, 224, 224)).cuda()
 
 
# convert to TensorRT feeding sample data as input
 
model_trt = torch2trt(model, [x])
 
这里首先把pytorch模型加载到CUDA，然后定义好输入的样例x（这里主要用来指定输入的shape，用ones, zeros都可以）。model_trt就是转成功的TensorRT模型，你运行上面代码没报错就证明你转tensorRT成功了。
 
这里有一个小坑就是，原模型和tensorRT模型可能占2份GPU内存（额，也可能是我多虑，没做进一步实验）。那就可以先把tensorRT模型保存下来，下次推理的时候直接加载tensorRT模型就好：
 
 
torch.save(model_trt.state_dict(), 'alexnet_trt.pth')
 
 
 
from torch2trt import TRTModule
 
 
model_trt = TRTModule()
 
 
model_trt.load_state_dict(torch.load('alexnet_trt.pth'))

推理的用法跟原来pytorch的用法一样：


 
y = model(x)
 
y_trt = model_trt(x)
 
 
# check the output against PyTorch
 
print(torch.max(torch.abs(y - y_trt)))

实验

我针对自己训的CNN网络进行tensorRT加速，得到以下结果：

	单帧推理
baseline	16.7ms
baseline + tensorRT	8ms

在accuracy上，差异小于千分之一。快了1倍左右，所以TensorRT加速效果还可以。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/351068