*"When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a...

jszymborski · 2025-01-30T15:55:04 1738252504

The trouble now is finding an RTX 4090.

hnuser123456 · 2025-01-30T16:46:56 1738255616

RTX 3090s are easy to find and work just as well.

petercooper · 2025-01-30T17:26:27 1738257987

Running the Q4 quant (14GB or so in size) at 46 tok/sec on a 3090 Ti right now if anyone's curious to performance. Want the headroom to try and max out the context.

earleybird · 2025-01-31T04:51:23 1738299083

Interesting - _q4 on a pair of 12Gb 3060s it runs at 20 tok/sec. _q8 (25Gb) on same is about 4 tok/sec.

petercooper · 2025-01-31T13:19:46 1738329586

~360GB/s memory bandwidth on the 3060, versus ~1008GB/s on the 3090 Ti probably accounts for that.

Given that, I'd expect a single 3060 (if a large enough one existed) to run at about 16 tok/s so 20 tok/s on two isn't bad not being NVLinked.

benkaiser · 2025-01-31T03:29:12 1738294152

Runs on an AMD 7900 XTX at about ~20 tokens per second using LM Studio + Vulkan.