Sad to see everyone so focused on compute expense during this massive breakthrou...

stocknoob · 2024-12-21T03:46:20 1734752780

It’s wild, are people purposefully overlooking that inference costs are dropping 10-100x each year?

https://a16z.com/llmflation-llm-inference-cost/

Look at the log scale slope, especially the orange MMLU > 83 data points.

menaerus · 2024-12-21T13:10:19 1734786619

Those are the (subsidized) prices that end clients are paying for the service so that's not something that is representative of what the actual inference costs are. Somebody still needs to pay that (actual) price in the end. For inference, as well as for training, you need actual (NVidia) hardware and that hardware didn't become any cheaper. OTOH models are only becoming increasingly more complex and bigger and with more and more demand I don't see those costs exactly dropping down.

atleastoptimal · 2024-12-21T15:25:22 1734794722

Actual inference costs without considering subsidies and loss leaders are going down, due to algorithmic improvements, hardware improvements, and quantized/smaller models getting the same performance as larger ones. Companies are making huge breakthroughs making chips specifically for LLM inference

menaerus · 2024-12-22T10:19:27 1734862767

In August 2023, llama2 34B was released and at that time, without employing model quantization, in order to fit this model one needed to have a GPU, or set of GPUs, with total of ~34x2.5=85G of VRAM.

That said, can you be more specific what are those "algorithmic" and "hardware" improvements that has driven this cost and hardware requirements down? AFAIK I still need the same hardware to run this very same model.

stocknoob · 2024-12-24T03:13:54 1735010034

Take a look at the latest Llama and Phi models. They get comparable MMLU performance for ~10% of the parameters. Not to mention the cost/flop and cost/gb for GPUs has dropped.

You aren’t trying to run an old 2023 model as is, you’re trying to match its capabilities. The old models just show what capabilities are possible.

menaerus · 2024-12-24T12:26:08 1735043168

Sure, let's say that 8B llama3.1 gets comparable performance of it's 70B llama2 predecessor. Not quite true but let's say that hypothetically it is. That still leave us with 70B llama3.1.

How much VRAM and inference compute is required to run 3.1-70B vs 2-70B?

The argument is that the inference cost is dropping down significantly each year but how exactly if those two models require about the ~same, give or take, amount of VRAM and compute?

One way to drive the cost down is to innovate in inference algorithms such that the HW requirements are loosened up.

In the context of inference optimizations one such is flash-decode, similar to its training counter-part flash-attention, from the same authors. However, that particular optimization concerns only by improving the inference runtime by dropping down the number of memory accesses needed to compute the self-attention. Amount of total VRAM you need in order to just load the model still remains the same so although it is true that you might get a tad more from the same HW, the initial requirement of total HW you need remains to be the same. Flash-decode is also nowhere near the impact of flash-attention. Latter enabled much faster training iteration runtimes while the former has had quite limited impact, mostly because scale of inference is so much smaller than the training so the improvements do not always see the large gains.

> Not to mention the cost/flop and cost/gb for GPUs has dropped.

For training. Not for inference. GPU prices remained about the same, give or take.

stocknoob · 2024-12-25T16:15:13 1735143313

> How much VRAM and inference compute is required to run 3.1-70B vs 2-70B?

We aren’t trying to mindlessly consume the same VRAM as last year and hope costs magically drop. We are noticing that we can get last year’s mid-level performance on this year’s low-end model, leading to cost savings at that perf level. The same thing happens next year, leading to a drop in cost at any given perf level over time.

> For training. Not for inference. GPU prices remained about the same, give or take.

See:

https://epoch.ai/blog/trends-in-gpu-price-performance

We don’t care about the absolute price, is the cost per flop or cost per GB decreasing over time with each new GPU?

—-

If it isn’t clear why inference costs at any given performance level will drop given the points above, unfortunately I can’t help you further.

menaerus · 2024-12-27T11:48:10 1735300090

We absolutely care about absolute costs. 70B model this year will cost as much as it will next year, unless Nvidia decides to lose their profits. The question is whether an inference cost is dropping down. And the answer is obviously no. I see that you're out of your depth so let's just stop here.

croes · 2024-12-21T10:17:37 1734776257

A bit early for a every year claim not to mention what all these AI is used for.

In some parts of the internet it’s you hardly find real content only AI spam.

It will get worse the cheaper it gets.

Think of email spam.

yawnxyz · 2024-12-21T01:43:01 1734745381

I think the question everyone has in their minds isn't "when will AGI get here" or even "how soon will it get here" — it's "how soon will AGI get so cheap that everyone will get their hands on it"

that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff

croes · 2024-12-21T10:18:34 1734776314

We are nowhere near AGI.