Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Do LLMs need a context window?
1 point by kleene_op on Dec 26, 2023 | hide | past | favorite | 5 comments
Excuse me for this potentially dumb question, but..

Why don't we train LLMs with user inputs at each step instead of keeping the model static and feeding the whole damn history everytime?

I think I may have a clue as to why actually: Is it because this would force us to duplicate the model for every user (since their weights would diverge) and company like OpenAI deem it too costly?

If so, will the rise in affordability of local models downloaded by individuals enable the switch for the continuous training approach soon enough or am I forgetting something?




You can't train an LLM in real time feasibly. Training and inference are two different things.

OpenAI has a separate service for finetuning ChatGPT and it is not speedy, and that's likely with shenanigans.


Even if the user input at every step is just normal conversation sentences? By that I mean very short, not MB of text.


Yes, it would take several times as much memory (e.g. the AdamW optimizer uses 8 bytes per paramter). Plus, your latest sentences would just get mixed in to the model, and it wouldn't have a concept of the current conversation.

There are very different approaches that might behave more like what you want, maybe RWKV.


Thanks for your suggestion about RWKV. I'll read up on that.


https://github.com/BlinkDL/RWKV-LM#rwkv-discord-httpsdiscord... lists a number of implementations of various versions of RWKV.

https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-w... :

> RWKV: Parallelizable RNN with Transformer-level LLM Performance (pronounced as "RwaKuv", from 4 major params: R W K V)

> RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

> So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

> Our latest version is RWKV-6,




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: