Not sure how to formulate this, but what does this mean in the sense of how "sma...

simonw · 2024-10-11T19:37:14 1728675434

The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.

The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.

You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/

iknowstuff · 2024-10-12T00:23:09 1728692589

anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.

gpt4 tends to be more accurate than 4o for me.

airstrike · 2024-10-12T01:57:59 1728698279

I sort of do, especially against OG GPT-4 (before turbo)

4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!

It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...

o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)

iammrpayments · 2024-10-12T04:42:27 1728708147

Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.

csomar · 2024-10-12T03:45:47 1728704747

gpt-4o is a weak version of gpt-4 with "steps-instructions". Gpt-4 is just too expensive which is why openAI is releasing all these mini versions.

MuffinFlavored · 2024-10-12T11:32:13 1728732733

> that has given me good results.

Can you help somebody out of the loop frame/judge/measure 'good results'?

Can you give an example of something it can do that's impressive/worthwhile? Can you give an example of where it falls short / gets tripped up?

Is it just a hallucination machine? What good does that do for anybody? Genuinely trying to understand.

simonw · 2024-10-12T13:51:40 1728741100

It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.

For a 1.2GB file that runs on my laptop those are all impressive to me.

Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.

littlestymaar · 2024-10-11T19:38:48 1728675528

The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).

KeplerBoy · 2024-10-11T20:14:56 1728677696

The implementation absolutely can influence the outputs.

If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.

It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.

jiggawatts · 2024-10-11T22:14:57 1728684897

I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.

janwas · 2024-10-12T09:06:23 1728723983

We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.

Does anyone have experience with higher-precision matmul and whether it is worthwhile?

ComputerGuru · 2024-10-12T16:47:25 1728751645

Isn’t 200 tokens basically nothing? Did you mean to say 2000?

janwas · 2024-10-12T19:40:51 1728762051

That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.

KeplerBoy · 2024-10-11T23:22:02 1728688922

I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.

sroussey · 2024-10-11T20:45:43 1728679543

How well does bf16 work in comparison?

KeplerBoy · 2024-10-11T20:56:35 1728680195

Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.

I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.

littlestymaar · 2024-10-11T20:19:29 1728677969

TIL, thanks.