Hacker News new | past | comments | ask | show | jobs | submit login

Been in the Mac ecosystem since 2008, love it, but there is, and always has been, a tendency to talk about inevitabilities from scaling bespoke, extremely expensive configurations, and with LLMs, there's heavy eliding of what the user experience is, beyond noting response generation speed in tokens/s.

They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.

And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.

Usually it'd be obvious this'd trickle down, things always do, right?

But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.

And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.

I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.

I expect the local LLM community to be roughly the same size it is today 5 years from now.

* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding




I have a 2023 mbp, and I get about 100-150 tok/sec locally with lmstudio.


Which models?


For context, I got M2 Max MBP, 64 GB shared RAM, bought it March 2023 for $5-6K.

  Llama 3.2 1.0B - 650 t/s
  Phi 3.5   3.8B - 60 t/s.
  Llama 3.1 8.0B - 37 t/s.
  Mixtral  14.0B - 24 t/s.
Full GPU acceleration, using llama.cpp, just like LM Studio.


hugging-quants/llama-3.2-1b-instruct-q8_0-gguf - 100-150 tok/sec

second-state/llama-2-7b-chat-gguf net me around ~35 tok/sec

lmstudio-community/granite-3.1.-8b-instruct-GGUF - ~50 tok/sec

MBP M3 Max, 64g. - $3k


I'm not sure if you're pointing out any / all of these:

#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.

#2. Llama 1B is roughly GPT-4.

#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.

On my end:

#1. Agreed.

#2. Vehemently disagree.

#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.


I specifically missed the GPT4 part of "up to 10 token/sec out of a kinda sorta GPT-4". Was just looking at token/sec.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: