wangzyon/NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

CudaPythonCMakeShellcudasgemm
This is stars and forks stats for /wangzyon/NVIDIA_SGEMM_PRACTICE repository. As of 28 Apr, 2024 this repository has 85 stars and 20 forks.

概述 面向NVIDIA GPU,使用CUDA编程逐步优化矩阵乘法运算性能: 核函数 描述 GFLOPS 自定义核函数/CUBLAS(%) CUBLAS 官方库函数 14448.69 基准 kernel_1 朴素实现 2262.168 15.65657 kernel_2 共享内存缓存 4216.536 29.18283 kernel_3 一维Thread Tile并行优化 7809.629 54.05078 kernel_4 二维Thread Tile并行优化 12251.3 84.79179 kernel_5 寄存器缓存 12177.95 84.28412 kernel_6 FLOAT4向量访存 13161.49 91.09125 kernel_7 双缓存预取 13634.98 94.36832 NVIDIA GeForce RTX 3090,矩阵尺寸5120 配置 编译采用 gcc 7.5.0 under Ubuntu 18.04.5 LTS NVIDIA CUDA version: CUDA 10.2; 目录 NVIDIA_SGEMM_PRACTICE ...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
PrestaShop/dockerDockerfileShellPython24101740
fxxkscript/emacs.dEmacs LispShell0000
ducminh-phan/reformat-gherkinGherkinPython180120
HL7/vulcan-eproduct-infoGLSLLiquidBatchfile11070
statsd/statsdJavaScriptShellOther17.2k02k0
Stability-AI/stability-sdkJupyter NotebookPython2.4k03290
BartoszPiotrowski/lean-premise-selectionLeanShellTypeScript12000
disnake-ru/guideMarkdownJavaScriptPython160250
ossia/libossiaMaxC++C1900270
treeform/fidgetNimTypeScriptHTML7300320