- Up to 1350+ FP8 TFLOPS on Hopper GPUs
- No heavy dependency, as clean as a tutorial
- Fully Just-In-Time compiled
- Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
- Supports dense layout and two MoE layouts
- Up to 1350+ FP8 TFLOPS on Hopper GPUs - No heavy dependency, as clean as a tutorial - Fully Just-In-Time compiled - Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes - Supports dense layout and two MoE layouts