当前位置:   article > 正文

模型部署 | TensorRT加速PyTorch实战教程

如何在模型里面使用tensorrt

作者 | 伯恩legacy  编辑 | 计算机视觉工坊

原文链接:https://zhuanlan.zhihu.com/p/88318324

点击下方卡片,关注“自动驾驶之心”公众号

ADAS巨卷干货,即可获取

点击进入→自动驾驶之心【模型部署】技术交流群

后台回复【模型部署工程】获取基于TensorRT的分类、检测任务的部署源码!

一.简介

TensorRT是Nvidia公司出的能加速模型推理的框架,其实就是让你训练的模型在测试阶段的速度加快,比如你的模型测试一张图片的速度是50ms,那么用tensorRT加速的话,可能只需要10ms。当然具体能加速多少也不能保证,反正确实速度能提升不少。但是TensorRT坑爹的地方在于,有些模型操作是不支持的、又或者就算支持但是支持并不完善,对于这些难题,要么自己写插件,要么就只能等待官方的更新了。

现在我们训练深度学习模型主流的框架有tensorflow,pytorch,mxnet,caffe等。这个贴子只涉及pytorch,对于tensorflow的话,可以参考TensorRT部署深度学习模型,https://zhuanlan.zhihu.com/p/84125533,这个帖子是c++如何部署TensorRT。其实原理都是一样的,对于tensorflow模型,需要把pb模型转化为uff模型;对于pytorch模型,需要把pth模型转化为onnx模型;对于caffe模型,则不需要转化,因为tensorRT是可以直接读取caffe模型的。mxnet模型也是需要转化为onnx的。

那么,这篇教学贴主要是从python和c++两种语言环境下,尝试将pytorch模型转化为tensorRT,教刚接触TensorRT的同学们如何快速上手。

二.TensorRT的安装

TensorRT的安装并不难,推荐安装最新版本的。由于我使用的是Centos,因此我一般是按照这个教程来安装TensorRT的。

CentOS安装TensorRT指南
https://tbr8.org/how-to-install-tensorrt-on-centos/

安装完成后,在python环境下import tensorrt看能不能成功,并且编译一下官方的sampleMnist的例子,如果都可以的话,就安装成功了。

d91bd6d2b837dc06169f212afd5b589f.png

python环境下,成功导入tensorrt

70c5918ce896aa16282d914b3c21752c.jpeg

运行官方的mnist例子

三.Python环境下pytorch模型如何转化为TensorRT

python环境下pytorch模型转化为TensorRT有两种路径,一种是先把pytorch的pt模型转化为onnx,然后再转化为TensorRT;另一种是直接把pytorch的pt模型转成TensorRT。

首先,我们先把pt模型转化为onnx模型,需要安装onnx,直接pip install onnx即可。我们以ResNet50为例,代码如下:

 
 
  1. import torchvision
  2. import torch
  3. from torch.autograd import Variable
  4. import onnx
  5. print(torch.__version__)
  6. input_name = ['input']
  7. output_name = ['output']
  8. input = Variable(torch.randn(1, 3, 224, 224)).cuda()
  9. model = torchvision.models.resnet50(pretrained=True).cuda()
  10. torch.onnx.export(model, input, 'resnet50.onnx', input_names=input_name, output_names=output_name, verbose=True)

以上代码使用torchvision里面预训练的resnet50模型为基础,将resnet50的pt模型转化成res50.onnx,其中规定onnx的输入名是'input',输出名是'output',输入图像的大小是3通道224x224。其中batch size是1,其实这个batch size你可以取3、4、5等。运行这个代码就可以生成一个名为resnet50.onnx文件。

最好检查一下生成的onnx,代码如下:

  1. test = onnx.load('resnet50.onnx')
  2. onnx.checker.check_model(test)
  3. print("==> Passed")

接下来比较一下pytorch模型和TensorRT的结果吧:

 
 
  1. import pycuda.autoinit
  2. import numpy as np
  3. import pycuda.driver as cuda
  4. import tensorrt as trt
  5. import torch
  6. import os
  7. import time
  8. from PIL import Image
  9. import cv2
  10. import torchvision
  11. filename = 'test.jpg'
  12. max_batch_size = 1
  13. onnx_model_path = 'resnet50.onnx'
  14. TRT_LOGGER = trt.Logger() # This logger is required to build an engine
  15. def get_img_np_nchw(filename):
  16. image = cv2.imread(filename)
  17. image_cv = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  18. image_cv = cv2.resize(image_cv, (224, 224))
  19. miu = np.array([0.485, 0.456, 0.406])
  20. std = np.array([0.229, 0.224, 0.225])
  21. img_np = np.array(image_cv, dtype=float) / 255.
  22. r = (img_np[:, :, 0] - miu[0]) / std[0]
  23. g = (img_np[:, :, 1] - miu[1]) / std[1]
  24. b = (img_np[:, :, 2] - miu[2]) / std[2]
  25. img_np_t = np.array([r, g, b])
  26. img_np_nchw = np.expand_dims(img_np_t, axis=0)
  27. return img_np_nchw
  28. class HostDeviceMem(object):
  29. def __init__(self, host_mem, device_mem):
  30. """Within this context, host_mom means the cpu memory and device means the GPU memory
  31. """
  32. self.host = host_mem
  33. self.device = device_mem
  34. def __str__(self):
  35. return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
  36. def __repr__(self):
  37. return self.__str__()
  38. def allocate_buffers(engine):
  39. inputs = []
  40. outputs = []
  41. bindings = []
  42. stream = cuda.Stream()
  43. for binding in engine:
  44. size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
  45. dtype = trt.nptype(engine.get_binding_dtype(binding))
  46. # Allocate host and device buffers
  47. host_mem = cuda.pagelocked_empty(size, dtype)
  48. device_mem = cuda.mem_alloc(host_mem.nbytes)
  49. # Append the device buffer to device bindings.
  50. bindings.append(int(device_mem))
  51. # Append to the appropriate list.
  52. if engine.binding_is_input(binding):
  53. inputs.append(HostDeviceMem(host_mem, device_mem))
  54. else:
  55. outputs.append(HostDeviceMem(host_mem, device_mem))
  56. return inputs, outputs, bindings, stream
  57. def get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="", \
  58. fp16_mode=False, int8_mode=False, save_engine=False,
  59. ):
  60. """Attempts to load a serialized engine if available, otherwise builds a new TensorRT engine and saves it."""
  61. def build_engine(max_batch_size, save_engine):
  62. """Takes an ONNX file and creates a TensorRT engine to run inference with"""
  63. with trt.Builder(TRT_LOGGER) as builder, \
  64. builder.create_network() as network, \
  65. trt.OnnxParser(network, TRT_LOGGER) as parser:
  66. builder.max_workspace_size = 1 << 30 # Your workspace size
  67. builder.max_batch_size = max_batch_size
  68. # pdb.set_trace()
  69. builder.fp16_mode = fp16_mode # Default: False
  70. builder.int8_mode = int8_mode # Default: False
  71. if int8_mode:
  72. # To be updated
  73. raise NotImplementedError
  74. # Parse model file
  75. if not os.path.exists(onnx_file_path):
  76. quit('ONNX file {} not found'.format(onnx_file_path))
  77. print('Loading ONNX file from path {}...'.format(onnx_file_path))
  78. with open(onnx_file_path, 'rb') as model:
  79. print('Beginning ONNX file parsing')
  80. parser.parse(model.read())
  81. print('Completed parsing of ONNX file')
  82. print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
  83. engine = builder.build_cuda_engine(network)
  84. print("Completed creating Engine")
  85. if save_engine:
  86. with open(engine_file_path, "wb") as f:
  87. f.write(engine.serialize())
  88. return engine
  89. if os.path.exists(engine_file_path):
  90. # If a serialized engine exists, load it instead of building a new one.
  91. print("Reading engine from file {}".format(engine_file_path))
  92. with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
  93. return runtime.deserialize_cuda_engine(f.read())
  94. else:
  95. return build_engine(max_batch_size, save_engine)
  96. def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
  97. # Transfer data from CPU to the GPU.
  98. [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
  99. # Run inference.
  100. context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
  101. # Transfer predictions back from the GPU.
  102. [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
  103. # Synchronize the stream
  104. stream.synchronize()
  105. # Return only the host outputs.
  106. return [out.host for out in outputs]
  107. def postprocess_the_outputs(h_outputs, shape_of_output):
  108. h_outputs = h_outputs.reshape(*shape_of_output)
  109. return h_outputs
  110. img_np_nchw = get_img_np_nchw(filename)
  111. img_np_nchw = img_np_nchw.astype(dtype=np.float32)
  112. # These two modes are dependent on hardwares
  113. fp16_mode = False
  114. int8_mode = False
  115. trt_engine_path = './model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
  116. # Build an engine
  117. engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode, int8_mode)
  118. # Create the context for this engine
  119. context = engine.create_execution_context()
  120. # Allocate buffers for input and output
  121. inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
  122. # Do inference
  123. shape_of_output = (max_batch_size, 1000)
  124. # Load data to the buffer
  125. inputs[0].host = img_np_nchw.reshape(-1)
  126. # inputs[1].host = ... for multiple input
  127. t1 = time.time()
  128. trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream) # numpy data
  129. t2 = time.time()
  130. feat = postprocess_the_outputs(trt_outputs[0], shape_of_output)
  131. print('TensorRT ok')
  132. model = torchvision.models.resnet50(pretrained=True).cuda()
  133. resnet_model = model.eval()
  134. input_for_torch = torch.from_numpy(img_np_nchw).cuda()
  135. t3 = time.time()
  136. feat_2= resnet_model(input_for_torch)
  137. t4 = time.time()
  138. feat_2 = feat_2.cpu().data.numpy()
  139. print('Pytorch ok!')
  140. mse = np.mean((feat - feat_2)**2)
  141. print("Inference time with the TensorRT engine: {}".format(t2-t1))
  142. print("Inference time with the PyTorch model: {}".format(t4-t3))
  143. print('MSE Error = {}'.format(mse))
  144. print('All completed!')

运行结果如下:

 
 
  1. TensorRT ok
  2. Pytorch ok!
  3. Inference time with the TensorRT engine: 0.0037250518798828125
  4. Inference time with the PyTorch model: 0.3574800491333008
  5. MSE Error = 3.297184357139993e-12

这个结果Pytorch模型ResNet50竟然需要340ms,感觉有些迷,但是好像没发现有啥问题。可以发现,TensorRT进行inference的结果和pytorch前向的结果差距很小。代码来源于https://github.com/RizhaoCai/PyTorch_ONNX_TensorRT

接下来介绍python环境下,直接把pytorch模型转化为TensorRT,参考的代码来源于NVIDIA-AI-IOT/torch2trt,https://github.com/NVIDIA-AI-IOT/torch2trt这个工程比较简单易懂,质量很高,安装也不难,我自己运行的结果如下:

40737bfa0c126fc31503e10f273d6c94.jpeg

对于你自己的Pytorch模型,只需要把该代码的model进行替换即可。注意在运行过程中经常会出现"output tensor has no attribute _trt",这是因为你模型当中有一些操作还没有实现,需要自己实现。

四.C++环境下Pytorch模型如何转化为TensorRT

c++环境下,以TensorRT5.1.5.0的sampleOnnxMNIST为例子,用opencv读取一张图片,然后让TensorRT进行doInference输出(1,1000)的特征。代码如下所示,把这个代码替换sampleOnnxMNIST替换,然后编译就能运行了。

 
 
  1. #include <algorithm>
  2. #include <assert.h>
  3. #include <cmath>
  4. #include <cuda_runtime_api.h>
  5. #include <fstream>
  6. #include <iomanip>
  7. #include <iostream>
  8. #include <sstream>
  9. #include <sys/stat.h>
  10. #include <time.h>
  11. #include <opencv2/opencv.hpp>
  12. #include "NvInfer.h"
  13. #include "NvOnnxParser.h"
  14. #include "argsParser.h"
  15. #include "logger.h"
  16. #include "common.h"
  17. #include "image.hpp"
  18. #define DebugP(x) std::cout << "Line" << __LINE__ << " " << #x << "=" << x << std::endl
  19. using namespace nvinfer1;
  20. static const int INPUT_H = 224;
  21. static const int INPUT_W = 224;
  22. static const int INPUT_C = 3;
  23. static const int OUTPUT_SIZE = 1000;
  24. const char* INPUT_BLOB_NAME = "input";
  25. const char* OUTPUT_BLOB_NAME = "output";
  26. const std::string gSampleName = "TensorRT.sample_onnx_image";
  27. samplesCommon::Args gArgs;
  28. bool onnxToTRTModel(const std::string& modelFile, // name of the onnx model
  29. unsigned int maxBatchSize, // batch size - NB must be at least as large as the batch we want to run with
  30. IHostMemory*& trtModelStream) // output buffer for the TensorRT model
  31. {
  32. // create the builder
  33. IBuilder* builder = createInferBuilder(gLogger.getTRTLogger());
  34. assert(builder != nullptr);
  35. nvinfer1::INetworkDefinition* network = builder->createNetwork();
  36. auto parser = nvonnxparser::createParser(*network, gLogger.getTRTLogger());
  37. //Optional - uncomment below lines to view network layer information
  38. //config->setPrintLayerInfo(true);
  39. //parser->reportParsingInfo();
  40. if ( !parser->parseFromFile( locateFile(modelFile, gArgs.dataDirs).c_str(), static_cast<int>(gLogger.getReportableSeverity()) ) )
  41. {
  42. gLogError << "Failure while parsing ONNX file" << std::endl;
  43. return false;
  44. }
  45. // Build the engine
  46. builder->setMaxBatchSize(maxBatchSize);
  47. //builder->setMaxWorkspaceSize(1 << 20);
  48. builder->setMaxWorkspaceSize(10 << 20);
  49. builder->setFp16Mode(gArgs.runInFp16);
  50. builder->setInt8Mode(gArgs.runInInt8);
  51. if (gArgs.runInInt8)
  52. {
  53. samplesCommon::setAllTensorScales(network, 127.0f, 127.0f);
  54. }
  55. samplesCommon::enableDLA(builder, gArgs.useDLACore);
  56. ICudaEngine* engine = builder->buildCudaEngine(*network);
  57. assert(engine);
  58. // we can destroy the parser
  59. parser->destroy();
  60. // serialize the engine, then close everything down
  61. trtModelStream = engine->serialize();
  62. engine->destroy();
  63. network->destroy();
  64. builder->destroy();
  65. return true;
  66. }
  67. void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
  68. {
  69. const ICudaEngine& engine = context.getEngine();
  70. // input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings(),
  71. // of these, but in this case we know that there is exactly one input and one output.
  72. assert(engine.getNbBindings() == 2);
  73. void* buffers[2];
  74. // In order to bind the buffers, we need to know the names of the input and output tensors.
  75. // note that indices are guaranteed to be less than IEngine::getNbBindings()
  76. const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
  77. const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
  78. DebugP(inputIndex); DebugP(outputIndex);
  79. // create GPU buffers and a stream
  80. CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float)));
  81. CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));
  82. cudaStream_t stream;
  83. CHECK(cudaStreamCreate(&stream));
  84. // DMA the input to the GPU, execute the batch asynchronously, and DMA it back:
  85. CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
  86. context.enqueue(batchSize, buffers, stream, nullptr);
  87. CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
  88. cudaStreamSynchronize(stream);
  89. // release the stream and the buffers
  90. cudaStreamDestroy(stream);
  91. CHECK(cudaFree(buffers[inputIndex]));
  92. CHECK(cudaFree(buffers[outputIndex]));
  93. }
  94. //!
  95. //! \brief This function prints the help information for running this sample
  96. //!
  97. void printHelpInfo()
  98. {
  99. std::cout << "Usage: ./sample_onnx_mnist [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]\n";
  100. std::cout << "--help Display help information\n";
  101. std::cout << "--datadir Specify path to a data directory, overriding the default. This option can be used multiple times to add multiple directories. If no data directories are given, the default is to use (data/samples/mnist/, data/mnist/)" << std::endl;
  102. std::cout << "--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, where n is the number of DLA engines on the platform." << std::endl;
  103. std::cout << "--int8 Run in Int8 mode.\n";
  104. std::cout << "--fp16 Run in FP16 mode." << std::endl;
  105. }
  106. int main(int argc, char** argv)
  107. {
  108. bool argsOK = samplesCommon::parseArgs(gArgs, argc, argv);
  109. if (gArgs.help)
  110. {
  111. printHelpInfo();
  112. return EXIT_SUCCESS;
  113. }
  114. if (!argsOK)
  115. {
  116. gLogError << "Invalid arguments" << std::endl;
  117. printHelpInfo();
  118. return EXIT_FAILURE;
  119. }
  120. if (gArgs.dataDirs.empty())
  121. {
  122. gArgs.dataDirs = std::vector<std::string>{"data/samples/mnist/", "data/mnist/"};
  123. }
  124. auto sampleTest = gLogger.defineTest(gSampleName, argc, const_cast<const char**>(argv));
  125. gLogger.reportTestStart(sampleTest);
  126. // create a TensorRT model from the onnx model and serialize it to a stream
  127. IHostMemory* trtModelStream{nullptr};
  128. if (!onnxToTRTModel("resnet50.onnx", 1, trtModelStream))
  129. gLogger.reportFail(sampleTest);
  130. assert(trtModelStream != nullptr);
  131. std::cout << "Successfully parsed ONNX file!!!!" << std::endl;
  132. std::cout << "Start reading the input image!!!!" << std::endl;
  133. cv::Mat image = cv::imread(locateFile("test.jpg", gArgs.dataDirs), cv::IMREAD_COLOR);
  134. if (image.empty()) {
  135. std::cout << "The input image is empty!!! Please check....."<<std::endl;
  136. }
  137. DebugP(image.size());
  138. cv::cvtColor(image, image, cv::COLOR_BGR2RGB);
  139. cv::Mat dst = cv::Mat::zeros(INPUT_H, INPUT_W, CV_32FC3);
  140. cv::resize(image, dst, dst.size());
  141. DebugP(dst.size());
  142. float* data = normal(dst);
  143. // deserialize the engine
  144. IRuntime* runtime = createInferRuntime(gLogger);
  145. assert(runtime != nullptr);
  146. if (gArgs.useDLACore >= 0)
  147. {
  148. runtime->setDLACore(gArgs.useDLACore);
  149. }
  150. ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), nullptr);
  151. assert(engine != nullptr);
  152. trtModelStream->destroy();
  153. IExecutionContext* context = engine->createExecutionContext();
  154. assert(context != nullptr);
  155. float prob[OUTPUT_SIZE];
  156. typedef std::chrono::high_resolution_clock Time;
  157. typedef std::chrono::duration<double, std::ratio<1, 1000>> ms;
  158. typedef std::chrono::duration<float> fsec;
  159. double total = 0.0;
  160. // run inference and cout time
  161. auto t0 = Time::now();
  162. doInference(*context, data, prob, 1);
  163. auto t1 = Time::now();
  164. fsec fs = t1 - t0;
  165. ms d = std::chrono::duration_cast<ms>(fs);
  166. total += d.count();
  167. // destroy the engine
  168. context->destroy();
  169. engine->destroy();
  170. runtime->destroy();
  171. std::cout << std::endl << "Running time of one image is:" << total << "ms" << std::endl;
  172. gLogInfo << "Output:\n";
  173. for (int i = 0; i < OUTPUT_SIZE; i++)
  174. {
  175. gLogInfo << prob[i] << " ";
  176. }
  177. gLogInfo << std::endl;
  178. return gLogger.reportTest(sampleTest, true);
  179. }

其中image.cpp的代码为:

 
 
  1. #include <opencv2/opencv.hpp>
  2. #include "image.hpp"
  3. static const float kMean[3] = { 0.485f, 0.456f, 0.406f };
  4. static const float kStdDev[3] = { 0.229f, 0.224f, 0.225f };
  5. static const int map_[7][3] = { {0,0,0} ,
  6. {128,0,0},
  7. {0,128,0},
  8. {0,0,128},
  9. {128,128,0},
  10. {128,0,128},
  11. {0,128,0}};
  12. float* normal(cv::Mat img) {
  13. //cv::Mat image(img.rows, img.cols, CV_32FC3);
  14. float * data;
  15. data = (float*)calloc(img.rows*img.cols * 3, sizeof(float));
  16. for (int c = 0; c < 3; ++c)
  17. {
  18. for (int i = 0; i < img.rows; ++i)
  19. { //获取第i行首像素指针
  20. cv::Vec3b *p1 = img.ptr<cv::Vec3b>(i);
  21. //cv::Vec3b *p2 = image.ptr<cv::Vec3b>(i);
  22. for (int j = 0; j < img.cols; ++j)
  23. {
  24. data[c * img.cols * img.rows + i * img.cols + j] = (p1[j][c] / 255.0f - kMean[c]) / kStdDev[c];
  25. }
  26. }
  27. }
  28. return data;
  29. }

image.hpp的内容为:

 
 
  1. #pragma once
  2. typedef struct {
  3. int w;
  4. int h;
  5. int c;
  6. float *data;
  7. } image;
  8. float* normal(cv::Mat img);

运行结果为:

cc82c5d0760bf0692f06574d6e0dea63.jpeg

同样的test.jpg在python环境下的运行结果为:

c480df2aa828cefe0ff9f1d72f8598f8.png

可以发现,c++环境下resnet50输出的(1,1000)的特征与python环境下feat1(TensorRT)和feat2(pytorch)的结果差距很小。

上面的是将pytorch首先转化为onnx,然后让TensorRT解析onnx从而构建TensorRT引擎。那么我们如何让TensorRT直接加载引擎文件呢,也就是说,我们先把onnx转化为TensorRT的trt文件,然后让c++环境下的TensorRT直接加载trt文件,从而构建engine。

在这里我们首先使用onnx-tensorrt这个项目来使resnet50.onnx转化为resnet50.trt。采用的项目是https://github.com/onnx/onnx-tensorrt这个项目的安装也不难。按要求安装好protobuf就可以。安装成功的结果如下:

2c247e1a809d7369384d1e8d8f503a76.jpeg

运行如下命令,就可以获得rensnet50.trt这个引擎文件

 
 
onnx2trt resnet50.onnx -o resnet50.trt

需要注意的是,onnx-tensort这个项目在编译的时有一个指定GPU计算能力的选项,如下图所示:

342511668932e32f59c8fbcecf4bae17.jpeg

https://developer.nvidia.com/cuda-gpus可以查看不同显卡的计算能力,比如你用7.5计算力生成的trt文件,是不能用6.5的显卡来解析的。

210a6059ef59e575a54aef1d654c7423.jpeg

另外在onnx2trt命令有个-b操作,是指定生成的trt文件的batch size的。在实际test过程中,你的batch size是多少,这个就设置成多少。我记得我当时trt文件的batch size是1,但是我实际的batch size是8,运行后,只有一张图片有结果,其他7张图片都是0。

如果能顺利生成trt文件的话,在代码中可以直接添加以下函数,来生成engine, 其他就不需要改变。

 
 
  1. bool read_TRT_File(const std::string& engineFile, IHostMemory*& trtModelStream)
  2. {
  3. std::fstream file;
  4. std::cout << "loading filename from:" << engineFile << std::endl;
  5. nvinfer1::IRuntime* trtRuntime;
  6. //nvonnxparser::IPluginFactory* onnxPlugin = createPluginFactory(gLogger.getTRTLogger());
  7. file.open(engineFile, std::ios::binary | std::ios::in);
  8. file.seekg(0, std::ios::end);
  9. int length = file.tellg();
  10. std::cout << "length:" << length << std::endl;
  11. file.seekg(0, std::ios::beg);
  12. std::unique_ptr<char[]> data(new char[length]);
  13. file.read(data.get(), length);
  14. file.close();
  15. std::cout << "load engine done" << std::endl;
  16. std::cout << "deserializing" << std::endl;
  17. trtRuntime = createInferRuntime(gLogger.getTRTLogger());
  18. //ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, onnxPlugin);
  19. ICudaEngine* engine = trtRuntime->deserializeCudaEngine(data.get(), length, nullptr);
  20. std::cout << "deserialize done" << std::endl;
  21. assert(engine != nullptr);
  22. std::cout << "The engine in TensorRT.cpp is not nullptr" <<std::endl;
  23. trtModelStream = engine->serialize();
  24. return true;
  25. }

如果想保存引擎文件的话,可以在自己的代码中添加这几句话,就可以生成trt文件,然后下次直接调用trt文件。

  1. nvinfer1::IHostMemory* data = engine->serialize();
  2. std::ofstream file;
  3. file.open(filename, std::ios::binary | std::ios::out);
  4. cout << "writing engine file..." << endl;
  5. file.write((const char*)data->data(), data->size());
  6. cout << "save engine file done" << endl;
  7. file.close();

五.总结

TensorRT的部署并不难,难的是模型转化,在这个过程中有太多的操作是TensorRT不支持的,或者pytorch模型转化成的onnx本身就有问题。经常会出现,expand, Gather, reshape不支持等。感觉TensorRT对pytorch的维度变化特别不友好,我自己在模型转化过程中绝大多数bug都出在维度变化上。如果你有什么问题的话,请在下方留言吧!好吧,暂时就先写这么多,以后再补充吧。

发现了一个tiny-tensorrt,貌似在C++和python环境下部署很easy,先记录一下。感兴趣的同学可以去看看:https://github.com/zerollzeng/tiny-tensorrt

国内首个自动驾驶学习社区

近1000人的交流社区,和20+自动驾驶技术栈学习路线,想要了解更多自动驾驶感知(分类、检测、分割、关键点、车道线、3D目标检测、多传感器融合、目标跟踪、光流估计、轨迹预测)、自动驾驶定位建图(SLAM、高精地图)、自动驾驶规划控制、领域技术方案、AI模型部署落地实战、行业动态、岗位发布,欢迎扫描下方二维码,加入自动驾驶之心知识星球,这是一个真正有干货的地方,与领域大佬交流入门、学习、工作、跳槽上的各类难题,日常分享论文+代码+视频,期待交流!

c65b2891c8a437cb22e88b5d7e3a6b11.jpeg

自动驾驶之心】全栈技术交流群

自动驾驶之心是首个自动驾驶开发者社区,聚焦目标检测、语义分割、全景分割、实例分割、关键点检测、车道线、目标跟踪、3D目标检测、BEV感知、多传感器融合、SLAM、光流估计、深度估计、轨迹预测、高精地图、NeRF、规划控制、模型部署落地、自动驾驶仿真测试、硬件配置、AI求职交流等方向;

13731804042e5c79d25d4a1ac3702845.jpeg

添加汽车人助理微信邀请入群

备注:学校/公司+方向+昵称

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/57593
推荐阅读
相关标签
  

闽ICP备14008679号