Hacker News new | past | comments | ask | show | jobs | submit login

Not sure how to formulate this, but what does this mean in the sense of how "smart" it is compared to the latest chatgpt version?



The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.

The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.

You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/


anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.

gpt4 tends to be more accurate than 4o for me.


I sort of do, especially against OG GPT-4 (before turbo)

4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!

It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...

o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)


Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.


gpt-4o is a weak version of gpt-4 with "steps-instructions". Gpt-4 is just too expensive which is why openAI is releasing all these mini versions.


> that has given me good results.

Can you help somebody out of the loop frame/judge/measure 'good results'?

Can you give an example of something it can do that's impressive/worthwhile? Can you give an example of where it falls short / gets tripped up?

Is it just a hallucination machine? What good does that do for anybody? Genuinely trying to understand.


It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.

For a 1.2GB file that runs on my laptop those are all impressive to me.

Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.


The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).


The implementation absolutely can influence the outputs.

If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.

It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.


I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.


We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.

Does anyone have experience with higher-precision matmul and whether it is worthwhile?


Isn’t 200 tokens basically nothing? Did you mean to say 2000?


That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.


I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.


How well does bf16 work in comparison?


Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.


TIL, thanks.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: