Maybe consider putting cutlass in your CUDA/Triton kernels (by maknee) — discussion

#ai