当前位置:   article > 正文

【代码分析】TensorRT sampleMNIST 详解

tensorrt samplemnist

目录

 

前言

代码分析

Main入口

网络构建(build)阶段

网络推理(infer) 阶段

释放资源


前言

TensorRT 的”hello world“ 程序sampleMNIST是众多TensorRT初学者很好的起点,本文旨在详细分析sampleMNIST的代码,从实践出发帮助理解TensorRT的相关概念、与cuda的关系、以及核心API的使用。

 

代码分析

sampleMNIST的github 代码参考link: https://github.com/NVIDIA/TensorRT/blob/release/6.0/samples/opensource/sampleMNIST/sampleMNIST.cpp

程序的主要流程分为 main与程序输入参数初始化 -> 网络构建 -> 网络推理 -> 释放资源结束 这几个阶段,下面逐个阶段分析代码

 

Main入口

  1. void printHelpInfo()
  2. {
  3. std::cout
  4. << "Usage: ./sample_mnist [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]\n";
  5. std::cout << "--help Display help information\n";
  6. std::cout << "--datadir Specify path to a data directory, overriding the default. This option can be used "
  7. "multiple times to add multiple directories. If no data directories are given, the default is to use "
  8. "(data/samples/mnist/, data/mnist/)"
  9. << std::endl;
  10. std::cout << "--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
  11. "where n is the number of DLA engines on the platform."
  12. << std::endl;
  13. std::cout << "--int8 Run in Int8 mode.\n";
  14. std::cout << "--fp16 Run in FP16 mode.\n";
  15. }
  16. int main(int argc, char** argv)
  17. {
  18. samplesCommon::Args args;
  19. bool argsOK = samplesCommon::parseArgs(args, argc, argv);
  • main函数开始获取程序的输入参数,允许指定caffe模型的文件目录、使用DLA engine的数目、int8或者fp16的模式,参考printHelpInfo()函数
  1. samplesCommon::CaffeSampleParams initializeSampleParams(const samplesCommon::Args& args)
  2. {
  3. samplesCommon::CaffeSampleParams params;
  4. if (args.dataDirs.empty()) //!< Use default directories if user hasn't provided directory paths
  5. {
  6. params.dataDirs.push_back("data/mnist/");
  7. params.dataDirs.push_back("data/samples/mnist/");
  8. }
  9. else //!< Use the data directory provided by the user
  10. {
  11. params.dataDirs = args.dataDirs;
  12. }
  13. params.prototxtFileName = locateFile("mnist.prototxt", params.dataDirs);
  14. params.weightsFileName = locateFile("mnist.caffemodel", params.dataDirs);
  15. params.meanFileName = locateFile("mnist_mean.binaryproto", params.dataDirs);
  16. params.inputTensorNames.push_back("data");
  17. params.batchSize = 1;
  18. params.outputTensorNames.push_back("prob");
  19. params.dlaCore = args.useDLACore;
  20. params.int8 = args.runInInt8;
  21. params.fp16 = args.runInFp16;
  22. return params;
  23. }
  24. ......
  25. int main(int arg, char** arg)
  26. {
  27. ......
  28. samplesCommon::CaffeSampleParams params = initializeSampleParams(args);
  • 根据程序运行参数生成CaffeSampleParams实例,包括配置caffe模型的默认目录、minist的proto文件、caff模型文件、binary proto文件,配置minist深度学习网络的input Tensor名字为data,output Tensor名字为prob,batch size为1,根据用户的输入参数来配置是由需要DLA Engine,是否运行在Int8 / FP16模式
  1. class SampleMNIST
  2. {
  3. template <typename T>
  4. using SampleUniquePtr = std::unique_ptr<T, samplesCommon::InferDeleter>;
  5. public:
  6. SampleMNIST(const samplesCommon::CaffeSampleParams& params)
  7. : mParams(params)
  8. ......
  9. int main(int argc, char** argv)
  10. {
  11. ......
  12. SampleMNIST sample(params);
  13. gLogInfo << "Building and running a GPU inference engine for MNIST" << std::endl;
  • 通过CaffeSampleParams作为配置参数来构造SampleMNIST对象,将配置参数保存到mParams中
  1. int main(int argc, char** argv)
  2. {
  3. ......
  4. if (!sample.build())
  5. {
  6. return gLogger.reportFail(sampleTest);
  7. }

通过SampleMNIST对象来创建MNIST深度学习网络,下面开始详细分析网络构建阶段的build方法

网络构建(build)阶段

  1. bool SampleMNIST::build()
  2. {
  3. auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger.getTRTLogger()));
  4. if (!builder)
  5. {
  6. return false;
  7. }
  8. auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetwork());
  9. if (!network)
  10. {
  11. return false;
  12. }
  13. auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
  14. if (!config)
  15. {
  16. return false;
  17. }
  18. auto parser = SampleUniquePtr<nvcaffeparser1::ICaffeParser>(nvcaffeparser1::createCaffeParser());
  19. if (!parser)
  20. {
  21. return false;
  22. }
  23. constructNetwork(parser, network);
  • TensorRT使用的标准流程即通过Logger创建IBuilder,通过IBuilder创建INetworkDefinition,通过INetworkDefinition创建IBuilderConfig,再创建用于解析Caffe模型的ICafferParser,然后调用constructNetwork通过ICafferParser对象分析caffe模型,通过INetworkDefinition对象创建可以被TensorRT优化和运行的网络
  1. void SampleMNIST::constructNetwork(
  2. SampleUniquePtr<nvcaffeparser1::ICaffeParser>& parser, SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
  3. {
  4. const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor = parser->parse(
  5. mParams.prototxtFileName.c_str(), mParams.weightsFileName.c_str(), *network, nvinfer1::DataType::kFLOAT);
  6. for (auto& s : mParams.outputTensorNames)
  7. {
  8. network->markOutput(*blobNameToTensor->find(s.c_str()));
  9. }
  10. // add mean subtraction to the beginning of the network
  11. nvinfer1::Dims inputDims = network->getInput(0)->getDimensions();
  12. mMeanBlob
  13. = SampleUniquePtr<nvcaffeparser1::IBinaryProtoBlob>(parser->parseBinaryProto(mParams.meanFileName.c_str()));
  14. nvinfer1::Weights meanWeights{nvinfer1::DataType::kFLOAT, mMeanBlob->getData(), inputDims.d[1] * inputDims.d[2]};
  15. // For this sample, a large range based on the mean data is chosen and applied to the head of the network.
  16. // After the mean subtraction occurs, the range is expected to be between -127 and 127, so the rest of the network
  17. // is given a generic range.
  18. // The preferred method is use scales computed based on a representative data set
  19. // and apply each one individually based on the tensor. The range here is large enough for the
  20. // network, but is chosen for example purposes only.
  21. float maxMean
  22. = samplesCommon::getMaxValue(static_cast<const float*>(meanWeights.values), samplesCommon::volume(inputDims));
  23. auto mean = network->addConstant(nvinfer1::Dims3(1, inputDims.d[1], inputDims.d[2]), meanWeights);
  24. mean->getOutput(0)->setDynamicRange(-maxMean, maxMean);
  25. network->getInput(0)->setDynamicRange(-maxMean, maxMean);
  26. auto meanSub = network->addElementWise(*network->getInput(0), *mean->getOutput(0), ElementWiseOperation::kSUB);
  27. meanSub->getOutput(0)->setDynamicRange(-maxMean, maxMean);
  28. network->getLayer(0)->setInput(0, *meanSub->getOutput(0));
  29. samplesCommon::setAllTensorScales(network.get(), 127.0f, 127.0f);
  30. }
  • 通过parser->parse方法分析caffe的模型和权重文件,构建network并返回可以通过名字查找数据ITensor的对象blobNameToTensor
  • 通过blobNameToTensor->find方法找到输入参数中指定的网络output ITensor对象,并通过network->markOutput标记它为网络的Output ITensor
  • 通过network->getInput(0)->getDimensions()找到网络的input ITensor对象并获取它的Dims维度对象
  • 通过parser->parseBinaryProto解析caffe权重平均值文件并包装为IBinaryProtoBlob对象
  • 创建Input的平均权重meanWeights,该权重的数据从mMeanBlob->getData()获得,数据个数是inputDims.d[1] * inputDims.d[2]
  • 如下图所示为网络的Input做一个范围限制处理,包括
  1. 通过network->addConstant方法创建一个IConstant Layer,该Layer的input是个3维Dims3对象
  2. 通过network->addElementWise方法创建一个IElementWise Layer,将原网络的Input和IConstant Layer的output作为Input求相减
  3. 最后通过network->getLayer(0)->setInput替换原网络的Input为IElementWise Layer的output,完成对原网络Input的范围限制处理
替换原网络的Input做范围限制处理

 

  1. bool SampleMNIST::build()
  2. {
  3. ......
  4. builder->setMaxBatchSize(mParams.batchSize);
  5. config->setMaxWorkspaceSize(16_MiB);
  6. config->setFlag(BuilderFlag::kGPU_FALLBACK);
  7. config->setFlag(BuilderFlag::kSTRICT_TYPES);
  8. if (mParams.fp16)
  9. {
  10. config->setFlag(BuilderFlag::kFP16);
  11. }
  12. if (mParams.int8)
  13. {
  14. config->setFlag(BuilderFlag::kINT8);
  15. }
  16. samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
  17. mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
  18. builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());
  19. if (!mEngine)
  20. return false;
  21. assert(network->getNbInputs() == 1);
  22. mInputDims = network->getInput(0)->getDimensions();
  23. assert(mInputDims.nbDims == 3);
  24. return true;
  25. }
  • constructNetwork函数执行完毕后,通过builder设置程序运行参数中的batchSize
  • 通过config设置每一层Layer的内存大小和相关FLAG
  • 通过enableDLA函数设置是否适用NV的DeepLearn Accelerator做硬件加速
  • 通过network和config对象创建ICudaEngine对象用户后续的推理过程
  • 最后确定network的input个数只有1个,input的维度为3维

 

网络推理(infer) 阶段

  1. bool SampleMNIST::infer()
  2. {
  3. // Create RAII buffer manager object
  4. samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);
  5. auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
  6. if (!context)
  7. {
  8. return false;
  9. }
  10. // Pick a random digit to try to infer
  11. srand(time(NULL));
  12. const int digit = rand() % 10;
  13. // Read the input data into the managed buffers
  14. // There should be just 1 input tensor
  15. assert(mParams.inputTensorNames.size() == 1);
  16. if (!processInput(buffers, mParams.inputTensorNames[0], digit))
  17. {
  18. return false;
  19. }
  20. .....
  21. int main(int argc, char** argv)
  22. {
  23. ......
  24. if (!sample.infer())
  25. {
  26. return gLogger.reportFail(sampleTest);
  27. }
  • main函数执行完build函数后,通过infer函数开始做网络推理
  • infer函数通过帮助类构建了BufferManager,用户创建和管理host与device的memory,如下图所示
BufferManager 主要类图
  • 模板类GenericBuffer通过模板参数AllocFunc和FreeFunc来指定Host和Device分配存储的类型,如下代码所示,DeviceAllocator/DeviceFree类使用了cudaMalloc/cudaFree方法从GPU Device分配和释放存储,HostAllocator/HostFree则时候用malloc/free方法从CPU Device分配和释放存储
  1. class DeviceAllocator
  2. {
  3. public:
  4. bool operator()(void** ptr, size_t size) const
  5. {
  6. return cudaMalloc(ptr, size) == cudaSuccess;
  7. }
  8. };
  9. class DeviceFree
  10. {
  11. public:
  12. void operator()(void* ptr) const
  13. {
  14. cudaFree(ptr);
  15. }
  16. };
  17. ......
  18. class HostAllocator
  19. {
  20. public:
  21. bool operator()(void** ptr, size_t size) const
  22. {
  23. *ptr = malloc(size);
  24. return *ptr != nullptr;
  25. }
  26. };
  27. class HostFree
  28. {
  29. public:
  30. void operator()(void* ptr) const
  31. {
  32. free(ptr);
  33. }
  34. };
  •  ManagerBuffer对象通过配对的deviceBuffer和hostBuffer来管理Device和Host 存储
  1. BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, const int& batchSize,
  2. const nvinfer1::IExecutionContext* context = nullptr)
  3. : mEngine(engine)
  4. , mBatchSize(batchSize)
  5. {
  6. // Create host and device buffers
  7. for (int i = 0; i < mEngine->getNbBindings(); i++)
  8. {
  9. auto dims = context ? context->getBindingDimensions(i) : mEngine->getBindingDimensions(i);
  10. size_t vol = context ? 1 : static_cast<size_t>(mBatchSize);
  11. nvinfer1::DataType type = mEngine->getBindingDataType(i);
  12. int vecDim = mEngine->getBindingVectorizedDim(i);
  13. if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector
  14. {
  15. int scalarsPerVec = mEngine->getBindingComponentsPerElement(i);
  16. dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);
  17. vol *= scalarsPerVec;
  18. }
  19. vol *= samplesCommon::volume(dims);
  20. std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
  21. manBuf->deviceBuffer = DeviceBuffer(vol, type);
  22. manBuf->hostBuffer = HostBuffer(vol, type);
  23. mDeviceBindings.emplace_back(manBuf->deviceBuffer.data());
  24. mManagedBuffers.emplace_back(std::move(manBuf));
  25. }
  26. }
  • BufferManager对象则管理多个ManagerBuffer,保存每个ManagerBuffer中deviceBuffer对应的设备存储器指针到DeviceBindering
  • BufferManager的构造函数可以看到,通过mEngine->getNbBindings()遍历当前网络的所有Input/Output(此处有个细节,即遍历的index i和Tensor的名字是有一一对应关系的,即通过Tensor的名字查找到的Binding index == 对应的index i ),对每个Input/Output获得它的维度dims和数据类型type,计算Input/Output的ITensor数据需要的存储器容量vol,通过构造ManagerBuffer的DeviceBuffer和HostBuffer对象来分配Device和Host存储(用于后续CPU Host端输入数据到GPU Device端),再将Device的数据指针保存到DeviceBindering,将ManagerBuffer保存到BufferManager的队列中,最终通过BufferManager获得了所有Input/Output的Device和Host 存储空间
  1. bool SampleMNIST::infer()
  2. {
  3. ......
  4. // Pick a random digit to try to infer
  5. srand(time(NULL));
  6. const int digit = rand() % 10;
  7. // Read the input data into the managed buffers
  8. // There should be just 1 input tensor
  9. assert(mParams.inputTensorNames.size() == 1);
  10. if (!processInput(buffers, mParams.inputTensorNames[0], digit))
  11. {
  12. return false;
  13. }
  14. ......
  15. bool SampleMNIST::processInput(
  16. const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
  17. {
  18. const int inputH = mInputDims.d[1];
  19. const int inputW = mInputDims.d[2];
  20. // Read a random digit file
  21. srand(unsigned(time(nullptr)));
  22. std::vector<uint8_t> fileData(inputH * inputW);
  23. readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm", mParams.dataDirs), fileData.data(), inputH, inputW);
  24. // Print ASCII representation of digit
  25. gLogInfo << "Input:\n";
  26. for (int i = 0; i < inputH * inputW; i++)
  27. {
  28. gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
  29. }
  30. gLogInfo << std::endl;
  31. float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer(inputTensorName));
  32. for (int i = 0; i < inputH * inputW; i++)
  33. {
  34. hostInputBuffer[i] = float(fileData[i]);
  35. }
  36. return true;
  37. }
  • 有了 BufferManager后通过processInput函数来获取Input数据,通过随机构建文件名的方式readPGMFfile 读取Input的数据
  • 如下代码所示,通过buffers.getHostBuffer(inputTensorName) 根据Input Tensor的名字找到对应的Binding index,进而找到对应的HostBuffer获得CPU Host端的存储指针
  • 通过inputH*inputW 计算input数据的尺寸、遍历input数据,将input数据从文件中读取到CPU 端的存储器中( hostInputBuffer[i] = float(fileData[i]); )
  1. void* getDeviceBuffer(const std::string& tensorName) const
  2. {
  3. return getBuffer(false, tensorName);
  4. }
  5. void* getHostBuffer(const std::string& tensorName) const
  6. {
  7. return getBuffer(true, tensorName);
  8. }
  9. ......
  10. void* getBuffer(const bool isHost, const std::string& tensorName) const
  11. {
  12. int index = mEngine->getBindingIndex(tensorName.c_str());
  13. if (index == -1)
  14. return nullptr;
  15. return (isHost ? mManagedBuffers[index]->hostBuffer.data() : mManagedBuffers[index]->deviceBuffer.data());
  16. }

 

  1. bool SampleMNIST::infer()
  2. {
  3. ......
  4. // Create CUDA stream for the execution of this inference.
  5. cudaStream_t stream;
  6. CHECK(cudaStreamCreate(&stream));
  7. // Asynchronously copy data from host input buffers to device input buffers
  8. buffers.copyInputToDeviceAsync(stream);
  9. ......
  • 通过cudaStreamCreate 创建cuda stream用于GPU Device上做并行计算流
  • 通过buffers.copyInputToDeviceAsync 将processInput中读取的Input数据从CPU 端异步传送到GPU Device端,如下代码所示copyInputToDeviceAsync最终会通过cudeMemcpyAsync方法结合CPU -> GPU还是GPU -> CPU的方向来异步传送数据
  1. void copyInputToDeviceAsync(const cudaStream_t& stream = 0)
  2. {
  3. memcpyBuffers(true, false, true, stream);
  4. }
  5. ......
  6. void memcpyBuffers(const bool copyInput, const bool deviceToHost, const bool async, const cudaStream_t& stream = 0)
  7. {
  8. for (int i = 0; i < mEngine->getNbBindings(); i++)
  9. {
  10. void* dstPtr
  11. = deviceToHost ? mManagedBuffers[i]->hostBuffer.data() : mManagedBuffers[i]->deviceBuffer.data();
  12. const void* srcPtr
  13. = deviceToHost ? mManagedBuffers[i]->deviceBuffer.data() : mManagedBuffers[i]->hostBuffer.data();
  14. const size_t byteSize = mManagedBuffers[i]->hostBuffer.nbBytes();
  15. const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;
  16. if ((copyInput && mEngine->bindingIsInput(i)) || (!copyInput && !mEngine->bindingIsInput(i)))
  17. {
  18. if (async)
  19. CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));
  20. else
  21. CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType));
  22. }
  23. }
  24. }

 

  1. bool SampleMNIST::infer()
  2. {
  3. ......
  4. // Asynchronously enqueue the inference work
  5. if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
  6. {
  7. return false;
  8. }
  9. // Asynchronously copy data from device output buffers to host output buffers
  10. buffers.copyOutputToHostAsync(stream);
  11. // Wait for the work in the stream to complete
  12. cudaStreamSynchronize(stream);
  13. // Release stream
  14. cudaStreamDestroy(stream);
  15. // Check and print the output of the inference
  16. // There should be just one output tensor
  17. assert(mParams.outputTensorNames.size() == 1);
  18. bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);
  19. return outputCorrect;
  20. }
  • 通过context->enqueue 通知TensorRT 进行网络推理过程,传入的参数包括batchSize,Input与Output的Device端存储器指针(其中Input的数据已经在processInput函数中传入Device端),用于cuda并行计算的stream流
  • 通过buffers.copyOutputToHostAsync将TensorRT计算结果从Device端的Output存储器指针copy到CPU端的存储器指针中
  • 通过cudaStreamSynchronize同步等待上面的所有计算完成,这样在buffers的CPU端Output指针中即保持了网络的推理结果
  • 通过cudaStreamDestroy(stream) 释放cuda并行计算资源
  1. bool SampleMNIST::verifyOutput(
  2. const samplesCommon::BufferManager& buffers, const std::string& outputTensorName, int groundTruthDigit) const
  3. {
  4. const float* prob = static_cast<const float*>(buffers.getHostBuffer(outputTensorName));
  5. // Print histogram of the output distribution
  6. gLogInfo << "Output:\n";
  7. float val{0.0f};
  8. int idx{0};
  9. const int kDIGITS = 10;
  10. for (int i = 0; i < kDIGITS; i++)
  11. {
  12. if (val < prob[i])
  13. {
  14. val = prob[i];
  15. idx = i;
  16. }
  17. gLogInfo << i << ": " << std::string(int(std::floor(prob[i] * 10 + 0.5f)), '*') << "\n";
  18. }
  19. gLogInfo << std::endl;
  20. return (idx == groundTruthDigit && val > 0.9f);
  21. }
  • 通过verifyOutput方法来验证网络推理结果的正确性
  • 通过buffers.getHostBuffer(outputTensorName)根据output Tensor的名字找到对应的Binding index,进而找到对应的HostBuffer和它的数据指针*prob
  • 遍历所有*prob找到概率最大的结果并输出
  • 最后判断概率最大的结果是否等于groundTruth,得出Output是否正确的结论

释放资源

  1. bool SampleMNIST::teardown()
  2. {
  3. //! Clean up the libprotobuf files as the parsing is complete
  4. //! \note It is not safe to use any other part of the protocol buffers library after
  5. //! ShutdownProtobufLibrary() has been called.
  6. nvcaffeparser1::shutdownProtobufLibrary();
  7. return true;
  8. }
  9. ......
  10. int main(int argc, char** argv)
  11. {
  12. .......
  13. if (!sample.teardown())
  14. {
  15. return gLogger.reportFail(sampleTest);
  16. }
  17. return gLogger.reportPass(sampleTest);
  18. }
  • 最后通过teardown 释放分配的资源,完成整个构建网络,网络推理的过程

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/57565
推荐阅读
  

闽ICP备14008679号