The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.
The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.
You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/
I sort of do, especially against OG GPT-4 (before turbo)
4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!
It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...
o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)
Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.
It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.
For a 1.2GB file that runs on my laptop those are all impressive to me.
Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.
The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).
The implementation absolutely can influence the outputs.
If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.
It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.
We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.
Does anyone have experience with higher-precision matmul and whether it is worthwhile?
Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.
I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.