That's flatly wrong. Each successive token costs progressively more. The deeper ...

earslap · on March 22, 2024

The way I worded it, it might seem wrong - and I agree with you. When I said "constant" I meant without any optimizations to speed up shorter contexts, so with full designed context, architecturally, it is constant. You can pad shorter active contexts with zeroes and avoid attending to empty spaces as an optimization, but that is just an optimization, not an architectural property. If you want "more computation" you fill the context with relevant data (chain of thought, or n-shot stuff), which is the "trick" Karpathy alluded to (it provides more context to attend to), and I agree with that analysis.

shawntan · on March 22, 2024

You're both kinda right. The type of computation that happens for that attention step that you refer to is parallel. I would say the thing that is "constant" is the computation graph depth (the number of sequential computations) which is actually important in computing certain functions.

https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/

visarga · on March 22, 2024

> The type of computation that happens for that attention step that you refer to is parallel

Flash attention, which is widely used, is no longer parallel. The attention matrix is solved batch by batch.