赞
踩
理论峰值 = GPU芯片数量*GPU Boost主频*核心数量*单个时钟周期内能处理的浮点计算次数
只不过在GPU里单精度和双精度的浮点计算能力需要分开计算,以最新的Tesla P100为例:
双精度理论峰值 = FP64 Cores * GPU Boost Clock * 2 = 1792 *1.48GHz*2 = 5.3 TFlops
单精度理论峰值 = FP32 cores * GPU Boost Clock * 2 = 3584 * 1.58GHz * 2 = 10.6 TFlop
# 1080TI
Total amount of global memory: 11172 MBytes (11715084288 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
# 1080
Total amount of global memory: 8111 MBytes (8504868864 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
~/NVIDIA_CUDA-8.0_Samples/7_CUDALibraries/batchCUBLAS$ export CUDA_VISIBLE_DEVICES=0 ~/NVIDIA_CUDA-8.0_Samples/7_CUDALibraries/batchCUBLAS$ ./batchCUBLAS -m1024 -n1024 -k1024 batchCUBLAS Starting... GPU Device 0: "GeForce GTX 1080 Ti" with compute capability 6.1 ==== Running single kernels ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00037980 sec GFLOPS=5654.24 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00894690 sec GFLOPS=240.026 @@@@ dgemm test OK ==== Running N=10 without streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x00000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00294209 sec GFLOPS=7299.19 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.07993412 sec GFLOPS=268.657 @@@@ dgemm test OK ==== Running N=10 with streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x40000000, 2) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00224590 sec GFLOPS=9561.78 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.05540895 sec GFLOPS=387.57 @@@@ dgemm test OK ==== Running N=10 batched ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x3f800000, 1) beta= (0xbf800000, -1) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00197387 sec GFLOPS=10879.6 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.05372214 sec GFLOPS=399.739 @@@@ dgemm test OK Test Summary 0 error(s)
liu@iridescent:~/NVIDIA_CUDA-8.0_Samples/7_CUDALibraries/batchCUBLAS$ export CUDA_VISIBLE_DEVICES=1 liu@iridescent:~/NVIDIA_CUDA-8.0_Samples/7_CUDALibraries/batchCUBLAS$ ./batchCUBLAS -m1024 -n1024 -k1024 batchCUBLAS Starting... GPU Device 0: "GeForce GTX 1080" with compute capability 6.1 ==== Running single kernels ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00060892 sec GFLOPS=3526.7 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00993085 sec GFLOPS=216.244 @@@@ dgemm test OK ==== Running N=10 without streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x00000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00369406 sec GFLOPS=5813.35 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.09741306 sec GFLOPS=220.451 @@@@ dgemm test OK ==== Running N=10 with streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x40000000, 2) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00317717 sec GFLOPS=6759.12 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.07991505 sec GFLOPS=268.721 @@@@ dgemm test OK ==== Running N=10 batched ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x3f800000, 1) beta= (0xbf800000, -1) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00302100 sec GFLOPS=7108.51 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.07566714 sec GFLOPS=283.807 @@@@ dgemm test OK Test Summary 0 error(s)
$ ./batchCUBLAS -m1024 -n1024 -k1024 batchCUBLAS Starting... GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2 ==== Running single kernels ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.00372291 sec GFLOPS=576.83 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.10940003 sec GFLOPS=19.6296 @@@@ dgemm test OK ==== Running N=10 without streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbf800000, -1) beta= (0x00000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.03462315 sec GFLOPS=620.245 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 1.09212208 sec GFLOPS=19.6634 @@@@ dgemm test OK ==== Running N=10 with streams ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x40000000, 2) beta= (0x40000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.03504515 sec GFLOPS=612.776 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 1.09177494 sec GFLOPS=19.6697 @@@@ dgemm test OK ==== Running N=10 batched ==== Testing sgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0x3f800000, 1) beta= (0xbf800000, -1) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 0.03766394 sec GFLOPS=570.17 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=1024 n=1024 k=1024 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2) #### args: lda=1024 ldb=1024 ldc=1024 ^^^^ elapsed = 1.09389901 sec GFLOPS=19.6315 @@@@ dgemm test OK Test Summary 0 error(s)
1080ti 1080 Jetson Tx2
GFLOPS=5654.24 GFLOPS=3526.7 GFLOPS=576.83
GFLOPS=7299.19 GFLOPS=5813.35 GFLOPS=620.245
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。