测试条件
测试矩阵:10000*10000的全一阵(元素全是1)
CPU:i7-77700HQ@2.8G,单核睿频3.5G,笔记本CPU
GPU:1060-6G
numpy
代码:
import numpy as np
import time
s=10000
# float类型
a=np.ones((s,s),dtype=np.float32)
b=np.ones((s,s),dtype=np.float32)
t=time.time()
c=np.dot(a,b)
print(time.time()-t)
# double类型
a=np.ones((s,s),dtype=np.float64)
b=np.ones((s,s),dtype=np.float64)
t=time.time()
c=np.dot(a,b)
print(time.time()-t)
在计算过程中似乎把我的四核八线程都跑满了。
结果:
float32(float):14.6s
float64(double):23.2s
cuBLAS
代码:
#include <cublas_v2.h>
template <typename T = float>
void testMatrixTime2(int M = 10000, int N = 10000, int K = 10000) {
T* A, * B, * C;
T* dev_A, * dev_B, * dev_C;
T alpha = 1, beta = 0;
// 初始化cuda句柄
cublasHandle_t handle;
cublasCreate(&handle);
A = (T*)malloc(M * K * sizeof(T));
B = (T*)malloc(K * N * sizeof(T));
C = (T*)malloc(M * N * sizeof(T));
cudaMalloc(&dev_A, M * K * sizeof(T));
cudaMalloc(&dev_B, K * N * sizeof(T));
cudaMalloc(&dev_C, M * N * sizeof(T));
for (int i = 0; i < M * K; ++i) {
A[i] = 1;
}
for (int i = 0; i < K * N; ++i) {
B[i] = 1;
}
clock_t t = clock();
cudaMemcpy(dev_A, A, M * K * sizeof(T), cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, K * N * sizeof(T), cudaMemcpyHostToDevice);
// double计算时改为cublasDgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
M, N, K,
&alpha,
dev_A, M,
dev_B, K,
&beta,
dev_C, M
);
cudaMemcpy(C, dev_C, M * N * sizeof(T), cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
cout << "time" << (double)(clock() - t) / CLOCKS_PER_SEC << endl;
}
结果:
float32(float):0.908s
float64(double):16.5s
说明游戏GPU的双精度运算能力还是很差的。
MATLAB
代码:
s=10000;
% 双精度
a=ones(s,s);
b=ones(s,s);
tic;
c=a*b;
toc;
% 单精度
a=single(a);
b=single(b);
tic;
c=a*b;
toc;
CPU跑到了60多,估计是四核并行计算。
结果:
float32(float):11.308408s
float64(double):23.026492s