siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

CudaShellPythonCMakeMakefile
This is stars and forks stats for /siboehm/SGEMM_CUDA repository. As of 29 Apr, 2024 this repository has 126 stars and 11 forks.

Fast CUDA SGEMM from Scratch Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM. Overview Running the kernels on a NVIDIA A6000 (Ampere): GFLOPs at matrix size 4096x4096: Kernel GFLOPs/s Performance relative to cuBLAS 1: Naive 309.0 1.3% 2: GMEM Coalescing 1986.5 8.5% 3: SMEM Caching 2980.3 12.8% 4: 1D Blocktiling 8474.7 36.5% 5: 2D Blocktiling 15971.7 68.7% 7: Avoid Bank Conflicts (Linearize) 16213.4 69.7% 8: Avoid...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
h3mmy/bloopySphereHCLShellFreeMarker27+170
threefoldtech/info_gridShellMakefileJavaScript1000
akhtyamovpavel/BuildExamples-TPMakefileC++C9080
bakueikozo/buildroot_am3352_akiMakefilePythonC18030
MatthewCroughan/nixcfgNixVim ScriptCSS167070
wizwizdev/wizwizxui-timebotPHPCSSShell65601300
GammaTauAI/reflexion-human-evalPythonJupyter NotebookShell1.5k+14134+1
stochasticai/xturingPython2.3k01840
guardian/typerighterScalaTypeScriptLess270+1110
calyptia/chartsSmartyShell5030