赞
踩
Floyd-Warshall算法(英语:Floyd-Warshall algorithm),中文亦称弗洛伊德算法或佛洛依德算法,是解决任意两点间的最短路径的一种算法,可以正确处理有向图或负权(但不可存在负权回路)的最短路径问题,同时也被用于计算有向图的闭包传递。
其本质为动态规划,给定有向图图
G
=
(
V
,
E
)
G = (V, E)
G=(V,E),其中
V
(
v
e
r
t
i
c
e
s
)
V(vertices)
V(vertices)为顶点数,
E
(
e
d
g
e
s
)
E(edges)
E(edges)为边数,并给出初始权重矩阵
w
[
i
]
[
j
]
w[i][j]
w[i][j],表示顶点
i
→
j
i \rightarrow j
i→j的权重,其表达式为:
w
i
,
j
=
{
weight of edge
(
i
,
j
)
if
(
i
,
j
)
∈
E
;
∞
if
(
i
,
j
)
∉
E
.
\left.w_{i,j}=\left\{
即,对于
i
→
j
i \rightarrow j
i→j未连通的边通常设置为一个无穷大的数
I
N
F
INF
INF;对于动态规划算法需要定义状态
D
i
,
j
,
k
D_{i,j,k}
Di,j,k:从
i
i
i到
j
j
j只以(
1..
k
1..k
1..k)集合中的节点为中间节点的最短路径的长度;则可分为以下2种情况讨论:
如果最短路经过点 k k k: D i , j , k = D i , k , k − 1 + D k , j , k − 1 . D_{i,j,k}=D_{i,k,k-1}+D_{k,j,k-1}. Di,j,k=Di,k,k−1+Dk,j,k−1.
若最短路径不经过点 k k k: D i , j , k = D i , j , k − 1 。 D_{i,j,k}=D_{i,j,k-1\text{ 。 }} Di,j,k=Di,j,k−1 。
若不能理解
k
−
1
k - 1
k−1的含义,则可理解为下一层
k
k
k的状态需要上一层
k
−
1
k - 1
k−1推导出(因为要逐个枚举中间节点,例如
D
1
,
3
=
D
1
,
2
+
D
2
,
3
D_{1,3} = D_{1,2} + D_{2,3}
D1,3=D1,2+D2,3,那么需要保证
D
1
,
2
,
D
2
,
3
D_{1,2},D_{2,3}
D1,2,D2,3是对应的最短距离,才能导致
D
1
,
3
D_{1,3}
D1,3是1号节点到3号节点的最短距离)即第
k
k
k层状态依赖于第
k
−
1
k-1
k−1层状态,故不可对
k
k
k层循环做并行化处理;最后可以得到状态转移方程:
D
i
,
j
,
k
=
min
(
D
i
,
j
,
k
−
1
,
D
i
,
k
,
k
−
1
+
D
k
,
j
,
k
−
1
)
D_{i,j,k}=\min(D_{i,j,k-1},D_{i,k,k-1}+D_{k,j,k-1})
Di,j,k=min(Di,j,k−1,Di,k,k−1+Dk,j,k−1)
在实际算法中,为了节约空间,可以直接在原来空间上进行迭代,这样空间可降至二维。
为了求到全部的最短路径,不仅需要计算最短路径距离矩阵
D
D
D,还需要计算最短路径构造矩阵
C
C
C。其中
C
C
C矩阵的定义为:如果在顶点
i
i
i和顶点
j
j
j之间至少存在一条最短路径,则
C
i
,
j
C_{i,j}
Ci,j表示最短路径上编号最高的中间顶点,否则为undefined (NULL)。构造矩阵的初值都是未定义的,用数学表示如下:
c
i
,
j
(
k
)
=
{
NULL
i
f
k
=
0
;
k
i
f
k
≥
1
a
n
d
d
i
,
j
(
k
−
1
)
>
d
i
,
k
(
k
−
1
)
+
d
k
,
j
(
k
−
1
)
;
c
i
,
j
(
k
−
1
)
otherwise.
,
\left.c_{i,j}^{(k)}=\left\{
其中
C
i
,
j
k
−
1
C_{i,j}^{k-1}
Ci,jk−1与上相同,由于下一层受到上一层的制约,为递推关系。
可以想象为二叉树,一边是往左子树遍历,一边是往右子树遍历,即左根右的中序遍历。
在分块联合算法中,将
n
×
n
n \times n
n×n的距离矩阵
D
i
,
j
D_{i,j}
Di,j和构造矩阵
C
i
,
j
C_{i,j}
Ci,j划分为
b
×
b
b \times b
b×b的子矩阵的分块,其中
b
b
b为分块因子,为以下问题讨论方便,假设
n
%
b
=
=
0
n \% b ==0
n%b==0,即
n
n
n能整除
b
b
b,并在每个块内有定义
A
I
,
J
=
a
[
i
,
j
]
A_{I, J} = a[i, j]
AI,J=a[i,j]来标识块索引为
(
I
,
J
)
(I,J)
(I,J)的子矩阵,用数学符号表示为:
1
≤
I
,
J
≤
[
n
b
]
,
1
≤
i
,
j
≤
b
1 \leq I, J \leq [\frac{n}{b}] , \\ 1 \leq i,j \leq b
1≤I,J≤[bn],1≤i,j≤b
如下图所示,展现了
n
=
12
n = 12
n=12矩阵的示例,其中
b
=
3
b=3
b=3
将该算法划分4个阶段为:
子分块联合APSP的矩阵-矩阵""乘-加"算法
代数中的MMA算法可以扩展为同时计算路径矩阵 D i , j D_{i,j} Di,j和构造矩阵 C i , j C_{i,j} Ci,j
z i , j ← min ( z i , j , ∑ k = 1 b x i , k + y k , j ) z_{i,j} \leftarrow \min(z_{i,j}, \sum_{k=1}^b x_{i,k}+y_{k,j}) zi,j←min(zi,j,k=1∑bxi,k+yk,j)
其中, z i , j z_{i,j} zi,j为 b × b b \times b b×b的子矩阵。
阶段2-阶段4可以使用矩阵乘法更新,在本问题中,就是极小加代数。极小加代数的乘法和加法是分离执行的,极小加操作(MINPLUS)是运行在GPU中,矩阵加(MMA)运行在CPU中。
这个操作减少了 Z , C Z,C Z,C从CPU到GPU的数据传输,也就允许了CPU和GPU之间的高速通信。
#include <stdio.h> #include <stdlib.h> const int INF = ((1 << 30) - 1); const int V = 50010; void input(char *inFileName); void output(char *outFileName); void block_FW(int B); int ceil(int a, int b); void cal(int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height); int n, m; static int Dist[V][V]; int main(int argc, char *argv[]) { input(argv[1]); int B = 512; block_FW(B); output(argv[2]); return 0; } void input(char *infile) { FILE *file = fopen(infile, "rb"); fread(&n, sizeof(int), 1, file); fread(&m, sizeof(int), 1, file); for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { if (i == j) { Dist[i][j] = 0; } else { Dist[i][j] = INF; } } } int pair[3]; for (int i = 0; i < m; ++i) { fread(pair, sizeof(int), 3, file); Dist[pair[0]][pair[1]] = pair[2]; } fclose(file); } void output(char *outFileName) { FILE *outfile = fopen(outFileName, "w"); for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { if (Dist[i][j] >= INF) Dist[i][j] = INF; } fwrite(Dist[i], sizeof(int), n, outfile); } fclose(outfile); } int ceil(int a, int b) { return (a + b - 1) / b; } void block_FW(int B) { int round = ceil(n, B); for (int r = 0; r < round; ++r) { printf("%d %d\n", r, round); fflush(stdout); /* Phase 1 */ cal(B, r, r, r, 1, 1); /* Phase 2 */ cal(B, r, r, 0, r, 1); cal(B, r, r, r + 1, round - r - 1, 1); cal(B, r, 0, r, 1, r); cal(B, r, r + 1, r, 1, round - r - 1); /* Phase 3 */ cal(B, r, 0, 0, r, r); cal(B, r, 0, r + 1, round - r - 1, r); cal(B, r, r + 1, 0, r, round - r - 1); cal(B, r, r + 1, r + 1, round - r - 1, round - r - 1); } } void cal( int B, int Round, int block_start_x, int block_start_y, int block_width, int block_height) { int block_end_x = block_start_x + block_height; int block_end_y = block_start_y + block_width; for (int b_i = block_start_x; b_i < block_end_x; ++b_i) { for (int b_j = block_start_y; b_j < block_end_y; ++b_j) { // To calculate B*B elements in the block (b_i, b_j) // For each block, it need to compute B times for (int k = Round * B; k < (Round + 1) * B && k < n; ++k) { // To calculate original index of elements in the block (b_i, b_j) // For instance, original index of (0,0) in block (1,2) is (2,5) for V=6,B=2 int block_internal_start_x = b_i * B; int block_internal_end_x = (b_i + 1) * B; int block_internal_start_y = b_j * B; int block_internal_end_y = (b_j + 1) * B; if (block_internal_end_x > n) block_internal_end_x = n; if (block_internal_end_y > n) block_internal_end_y = n; for (int i = block_internal_start_x; i < block_internal_end_x; ++i) { for (int j = block_internal_start_y; j < block_internal_end_y; ++j) { if (Dist[i][k] + Dist[k][j] < Dist[i][j]) { Dist[i][j] = Dist[i][k] + Dist[k][j]; } } } } } } }
#include <math.h> #include <stdio.h> #include <stdlib.h> #include <time.h> const int INF = (1 << 30) - 1; int vertex_num, edge_num, matrix_size; int *dist; double cal_time(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec - start.tv_nsec) < 0) { temp.tv_sec = end.tv_sec - start.tv_sec - 1; temp.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec; } else { temp.tv_sec = end.tv_sec - start.tv_sec; temp.tv_nsec = end.tv_nsec - start.tv_nsec; } return temp.tv_sec + (double)temp.tv_nsec / 1000000000.0; } __device__ __host__ size_t index_convert(int i, int j, int row_size) { return i * row_size + j; } void input(char *input_file_path, int &block_factor) { FILE *input_file = fopen(input_file_path, "rb"); fread(&vertex_num, sizeof(int), 1, input_file); fread(&edge_num, sizeof(int), 1, input_file); matrix_size = ceil((double)vertex_num / (double)block_factor) * block_factor; cudaMallocHost((void **)&dist, matrix_size * matrix_size * sizeof(int)); for (int i = 0; i < matrix_size; ++i) { for (int j = 0; j < matrix_size; ++j) { if (i != j) dist[index_convert(i, j, matrix_size)] = INF; else if (i < vertex_num) dist[index_convert(i, j, matrix_size)] = 0; else dist[index_convert(i, j, matrix_size)] = INF; } } int data[3]; for (int i = 0; i < edge_num; ++i) { fread(data, sizeof(int), 3, input_file); dist[index_convert(data[0], data[1], matrix_size)] = data[2]; } fclose(input_file); } void output(char *output_file_path) { FILE *output_file = fopen(output_file_path, "w"); for (int i = 0; i < vertex_num; ++i) { fwrite(&dist[index_convert(i, 0, matrix_size)], sizeof(int), vertex_num, output_file); } fclose(output_file); } __constant__ int size[3]; //matrix size, block_factor, grid_size __global__ void phase1(int *d_dist, int round) { __shared__ int share[4 * 1024]; int i = threadIdx.y; int j = threadIdx.x; int i_offset = size[1] * round; int j_offset = size[1] * round; share[index_convert(j, i, size[1])] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < size[1]; ++k) { __syncthreads(); if (share[index_convert(j, i, size[1])] > share[index_convert(j, k, size[1])] + share[index_convert(k, i, size[1])]) share[index_convert(j, i, size[1])] = share[index_convert(j, k, size[1])] + share[index_convert(k, i, size[1])]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = share[index_convert(j, i, size[1])]; } __global__ void phase2(int *d_dist, int round) { __shared__ int share[3 * 4 * 1024]; int i = threadIdx.y; int j = threadIdx.x; int i_offset, j_offset; if (blockIdx.x == 0) { i_offset = size[1] * ((round + blockIdx.y + 1) % size[2]); j_offset = size[1] * round; share[index_convert(i, j, size[1])] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; share[index_convert(i + size[1], j, size[1])] = share[index_convert(i, j, size[1])]; share[index_convert(i + 2 * size[1], j, size[1])] = d_dist[index_convert(j_offset + i, j_offset + j, size[0])]; } else { i_offset = size[1] * round; j_offset = size[1] * ((round + blockIdx.y + 1) % size[2]); share[index_convert(i, j, size[1])] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; share[index_convert(i + size[1], j, size[1])] = d_dist[index_convert(i_offset + i, i_offset + j, size[0])]; share[index_convert(i + 2 * size[1], j, size[1])] = share[index_convert(i, j, size[1])]; } #pragma unroll 32 for (int k = 0; k < size[1]; ++k) { __syncthreads(); if (share[index_convert(i, j, size[1])] > share[index_convert(i + size[1], k, size[1])] + share[index_convert(k + 2 * size[1], j, size[1])]) share[index_convert(i, j, size[1])] = share[index_convert(i + size[1], k, size[1])] + share[index_convert(k + 2 * size[1], j, size[1])]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = share[index_convert(i, j, size[1])]; } __global__ void phase3(int *d_dist, int round) { __shared__ int share[3 * 4 * 1024]; int i = threadIdx.y; int j = threadIdx.x; int i_offset = size[1] * ((round + blockIdx.y + 1) % size[2]); int j_offset = size[1] * ((round + blockIdx.x + 1) % size[2]); int r_offset = size[1] * round; share[index_convert(i, j, size[1])] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; share[index_convert(i + size[1], j, size[1])] = d_dist[index_convert(i_offset + i, r_offset + j, size[0])]; share[index_convert(i + 2 * size[1], j, size[1])] = d_dist[index_convert(r_offset + i, j_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < size[1]; ++k) { __syncthreads(); if (share[index_convert(i, j, size[1])] > share[index_convert(i + size[1], k, size[1])] + share[index_convert(k + 2 * size[1], j, size[1])]) share[index_convert(i, j, size[1])] = share[index_convert(i + size[1], k, size[1])] + share[index_convert(k + 2 * size[1], j, size[1])]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = share[index_convert(i, j, size[1])]; } int main(int argc, char **argv) { double total_time, bfd_time; timespec total_time1, total_time2, bfd_time1, bfd_time2; clock_gettime(CLOCK_MONOTONIC, &total_time1); cudaSetDevice(0); // 设置运行的为第0块GPU int block_factor = 32; if (argc == 4) block_factor = atoi(argv[3]); input(argv[1], block_factor); // 读取数据并初始化dist int grid_size = matrix_size / block_factor; // 划分后的网格大小N = [n / b] int size_info[3] = {matrix_size, block_factor, grid_size}; // n, b, N = [n / b] cudaMemcpyToSymbol(size, size_info, 3 * sizeof(int)); // 将矩阵大小、块大小和网格大小的信息传递给CUDA设备 int *d_dist; clock_gettime(CLOCK_MONOTONIC, &bfd_time1); cudaMalloc(&d_dist, (size_t)sizeof(int) * matrix_size * matrix_size); // 在GPU上分配内存 // 在GPU上分配和复制内存,将距离矩阵dist从主机(CPU)内存拷贝到设备(GPU)内存 cudaMemcpy(d_dist, dist, (size_t)sizeof(int) * matrix_size * matrix_size, cudaMemcpyHostToDevice); // 定义了CUDA的线程块和网格的维度 dim3 block(block_factor, block_factor); // (b, b) dim3 grid2(2, grid_size - 1); // (2, N - 1) dim3 grid3(grid_size - 1, grid_size - 1); // (N - 1, N - 1) for (int r = 0; r < grid_size; ++r) { phase1<<<1, block>>>(d_dist, r); phase2<<<grid2, block>>>(d_dist, r); phase3<<<grid3, block>>>(d_dist, r); } cudaMemcpy(dist, d_dist, (size_t)sizeof(int) * matrix_size * matrix_size, cudaMemcpyDeviceToHost); clock_gettime(CLOCK_MONOTONIC, &bfd_time2); output(argv[2]); cudaFree(d_dist); cudaFree(dist); clock_gettime(CLOCK_MONOTONIC, &total_time2); bfd_time = cal_time(bfd_time1, bfd_time2); total_time = cal_time(total_time1, total_time2); printf(" vertex: %d\n", vertex_num); printf(" I/O time: %.5f\n", total_time - bfd_time); printf(" cal time: %.5f\n", bfd_time); printf(" runtime: %.5f\n", total_time); return 0; }
#include <math.h> #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <time.h> const int INF = (1 << 30) - 1; int vertex_num, edge_num, matrix_size; int *dist; double cal_time(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec - start.tv_nsec) < 0) { temp.tv_sec = end.tv_sec - start.tv_sec - 1; temp.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec; } else { temp.tv_sec = end.tv_sec - start.tv_sec; temp.tv_nsec = end.tv_nsec - start.tv_nsec; } return temp.tv_sec + (double)temp.tv_nsec / 1000000000.0; } __device__ __host__ size_t index_convert(int i, int j, int row_size) { return i * row_size + j; } void input(char *input_file_path, int block_factor) { FILE *input_file = fopen(input_file_path, "rb"); fread(&vertex_num, sizeof(int), 1, input_file); fread(&edge_num, sizeof(int), 1, input_file); matrix_size = ceil((double)vertex_num / (double)block_factor) * block_factor; cudaMallocHost((void **)&dist, matrix_size * matrix_size * sizeof(int)); for (int i = 0; i < matrix_size; ++i) { for (int j = 0; j < matrix_size; ++j) { if (i != j) dist[index_convert(i, j, matrix_size)] = INF; else if (i < vertex_num) dist[index_convert(i, j, matrix_size)] = 0; else dist[index_convert(i, j, matrix_size)] = INF; } } int data[3]; for (int i = 0; i < edge_num; ++i) { fread(data, sizeof(int), 3, input_file); dist[index_convert(data[0], data[1], matrix_size)] = data[2]; } fclose(input_file); } void output(char *output_file_path) { FILE *output_file = fopen(output_file_path, "w"); for (int i = 0; i < vertex_num; ++i) { fwrite(&dist[index_convert(i, 0, matrix_size)], sizeof(int), vertex_num, output_file); } fclose(output_file); } __constant__ int size[3]; //matrix size, block_factor, grid_size __global__ void phase1(int *d_dist, int round) { __shared__ int pivot[1024]; int i = threadIdx.y; int j = threadIdx.x; int i_offset = 32 * round; int j_offset = 32 * round; pivot[index_convert(i, j, 32)] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < 32; ++k) { __syncthreads(); if (pivot[index_convert(i, j, 32)] > pivot[index_convert(i, k, 32)] + pivot[index_convert(k, j, 32)]) pivot[index_convert(i, j, 32)] = pivot[index_convert(i, k, 32)] + pivot[index_convert(k, j, 32)]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = pivot[index_convert(i, j, 32)]; } __global__ void phase2(int *d_dist, int round) { __shared__ int self[1024], pivot[1024]; int i = threadIdx.y; int j = threadIdx.x; int i_offset, j_offset; if (blockIdx.x == 0 && blockIdx.y != round) { i_offset = 32 * blockIdx.y; j_offset = 32 * round; self[index_convert(i, j, 32)] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; pivot[index_convert(i, j, 32)] = d_dist[index_convert(j_offset + i, j_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < 32; ++k) { __syncthreads(); if (self[index_convert(i, j, 32)] > self[index_convert(i, k, 32)] + pivot[index_convert(k, j, 32)]) self[index_convert(i, j, 32)] = self[index_convert(i, k, 32)] + pivot[index_convert(k, j, 32)]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = self[index_convert(i, j, 32)]; } else if (blockIdx.y != round) { i_offset = 32 * round; j_offset = 32 * blockIdx.y; self[index_convert(i, j, 32)] = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; pivot[index_convert(i, j, 32)] = d_dist[index_convert(i_offset + i, i_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < 32; ++k) { __syncthreads(); if (self[index_convert(i, j, 32)] > pivot[index_convert(i, k, 32)] + self[index_convert(k, j, 32)]) self[index_convert(i, j, 32)] = pivot[index_convert(i, k, 32)] + self[index_convert(k, j, 32)]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = self[index_convert(i, j, 32)]; } } __global__ void phase3(int *d_dist, int round, int grid_offset) { __shared__ int col[1024], row[1024]; int self; int block_i = grid_offset + blockIdx.y; int block_j = blockIdx.x; if (block_i == round || block_j == round) return; int i = threadIdx.y; int j = threadIdx.x; int i_offset = 32 * block_i; int j_offset = 32 * block_j; int r_offset = 32 * round; self = d_dist[index_convert(i_offset + i, j_offset + j, size[0])]; col[index_convert(i, j, 32)] = d_dist[index_convert(i_offset + i, r_offset + j, size[0])]; row[index_convert(i, j, 32)] = d_dist[index_convert(r_offset + i, j_offset + j, size[0])]; #pragma unroll 32 for (int k = 0; k < 32; ++k) { __syncthreads(); if (self > col[index_convert(i, k, 32)] + row[index_convert(k, j, 32)]) self = col[index_convert(i, k, 32)] + row[index_convert(k, j, 32)]; } d_dist[index_convert(i_offset + i, j_offset + j, size[0])] = self; } int main(int argc, char **argv) { const int block_factor = 32, device_num = 2; input(argv[1], block_factor); int grid_size = matrix_size / block_factor; int *d_dist[2]; #pragma omp parallel num_threads(device_num) { int device_id = omp_get_thread_num(); cudaSetDevice(device_id); int size_info[3] = {matrix_size, block_factor, grid_size}; cudaMemcpyToSymbol(size, size_info, 3 * sizeof(int)); int grid_partition = grid_size / device_num; int grid_offset = device_id * grid_partition; int grid_count = grid_partition; if (device_id == device_num - 1) grid_count += grid_size % device_num; size_t grid_start = grid_offset * block_factor * matrix_size; cudaMalloc(&(d_dist[device_id]), (size_t)sizeof(int) * matrix_size * matrix_size); #pragma omp barrier cudaMemcpy(&(d_dist[device_id][grid_start]), &(dist[grid_start]), (size_t)sizeof(int) * block_factor * grid_count * matrix_size, cudaMemcpyHostToDevice); dim3 block(block_factor, block_factor); dim3 grid2(2, grid_size); dim3 grid3(grid_size, grid_count); for (int r = 0; r < grid_size; ++r) { if (grid_offset <= r && r < grid_offset + grid_count) { size_t copy_start = r * block_factor * matrix_size; if (device_id == 0) cudaMemcpy(&(d_dist[1][copy_start]), &(d_dist[0][copy_start]), (size_t)sizeof(int) * block_factor * matrix_size, cudaMemcpyDeviceToDevice); else cudaMemcpy(&(d_dist[0][copy_start]), &(d_dist[1][copy_start]), (size_t)sizeof(int) * block_factor * matrix_size, cudaMemcpyDeviceToDevice); } #pragma omp barrier phase1<<<1, block>>>(d_dist[device_id], r); phase2<<<grid2, block>>>(d_dist[device_id], r); phase3<<<grid3, block>>>(d_dist[device_id], r, grid_offset); } cudaMemcpy(&(dist[grid_start]), &(d_dist[device_id][grid_start]), (size_t)sizeof(int) * block_factor * grid_count * matrix_size, cudaMemcpyDeviceToHost); cudaFree(d_dist[omp_get_thread_num()]); #pragma omp barrier } output(argv[2]); cudaFree(dist); return 0; }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。