When you click on the strip link to preorder the tinybox, it is advertised as a box running LLaMA 65B FP16 for $15000.
To be fair, the previous page has a bit more details on the hardware.
I can run LLaMA 65B GPTQ4b on my $2300 PC (built from used parts, 128GB RAM, Dual RTX 3090 @ PCIe 4.0x8 + NVLink), and according to the GPTQ paper(§) the quality of the model will not suffer much at all by the quantization.
Just saying, open source is squeezing an amazing amount of LLM goodness out of commodity hardware.
Are you able to memory pool two 3090s for 48gb and if so what's your setup?
I looked into this previously[1] but wasn't super confident it's possible or what hardware is required (2x x8 pcie and official SLI support?).
AFAICT still would look like two GPUs to the system.
You can memory pool with the right software however pytorch supports spreading large models over multiple GPUs OOTB.
Just pass the --gpu-memory parameter with two values (one per GPU) to oobabooga's text-generation-webui for example.
I'm using air cooling. Gigabyte X570 Pro, RTX 3090 FE, be quiet! Pure Base 500DX mesh case with four 140mm fans currently. It's not quiet under heavy load!
The GPUs have gotten improved thermal pads. The 8 core Ryzen 3700X is a 65W model. It appears to be fast enough not to be a bottleneck for this purpose. 1200W PSU.
I may swap the fan in front of the GPUs for a high-rpm model.
Also for longer runs i throttle the GPU power draw. It doesn't cost much performance.
To be fair, the previous page has a bit more details on the hardware.
I can run LLaMA 65B GPTQ4b on my $2300 PC (built from used parts, 128GB RAM, Dual RTX 3090 @ PCIe 4.0x8 + NVLink), and according to the GPTQ paper(§) the quality of the model will not suffer much at all by the quantization.
Just saying, open source is squeezing an amazing amount of LLM goodness out of commodity hardware.
(§) https://arxiv.org/abs/2210.17323