Hacker News new | past | comments | ask | show | jobs | submit login

"When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM."



The trouble now is finding an RTX 4090.


RTX 3090s are easy to find and work just as well.


Running the Q4 quant (14GB or so in size) at 46 tok/sec on a 3090 Ti right now if anyone's curious to performance. Want the headroom to try and max out the context.


Interesting - _q4 on a pair of 12Gb 3060s it runs at 20 tok/sec. _q8 (25Gb) on same is about 4 tok/sec.


~360GB/s memory bandwidth on the 3060, versus ~1008GB/s on the 3090 Ti probably accounts for that.

Given that, I'd expect a single 3060 (if a large enough one existed) to run at about 16 tok/s so 20 tok/s on two isn't bad not being NVLinked.


Runs on an AMD 7900 XTX at about ~20 tokens per second using LM Studio + Vulkan.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: