Cjkkkk/CUDA_gemm

A simple high performance CUDA GEMM implementation.

CudaPythonC++MakefileShell
This is stars and forks stats for /Cjkkkk/CUDA_gemm repository. As of 26 Apr, 2024 this repository has 187 stars and 23 forks.

introduction A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation. C = alpha * A * B + beta * C algorithm located in src/cuda/ MatrixMulCUDA one element of C is assigned one thread global memory coalesce of B MatrixMulCUDA1 texture load MatrixMulCUDA2 one 4 * 4 grid of C is assigned one thread MatrixMulCUDA3 vectorized A B load MatrixMulCUDA4 vectorized C store MatrixMulCUDA5 block sparse version MatrixMulCUDA6 vectorized A B load coalesce MatrixMulCUDA7 warp...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
FLAMEGPU/FLAMEGPU2CudaC++Python720140
resemble-ai/monotonic_alignCythonPython53050
mikumifa/QChatGPT-Docker-InstallerDockerfileShell1270260
emacs-straight/persistEmacs LispMakefile1000
m3g/packmolFortranTclMakefile1680470
haskell/ghcup-hsHaskellShellPowerShell210+255+1
cloudnloud/weekly-cloud-engineer-interview-programHCLShellPython300400
dorneanu/gocialHTMLGoCSS42030
PacktPublishing/The-Machine-Learning-Solutions-Architect-HandbookJupyter NotebookPythonShell105+1320
tomondre/raspberry-kubernetes-clusterHCLShellJinja77020