What is the role of LLM memory/RAG in the short to medium term given in the long term context windows might be extremely large? Got some insights from discussing LLM context windows with someone on the DeepMind team. Working on Gemma, they found they could stretch context length but hit quality issues - tokens at beginning and end get retrieved well, but middle ones get lost in the attention mechanism. Their interesting take: they initially thought longer context would just solve everything, but deeper analysis showed that with fixed parameter count, quality doesn't come for free. Despite pushing context lengths further, they still see RAG approaches as necessary for the near future (6-12 months) because of these attention quality challenges. For now, the retrieval problem isn't fully solved just by making contexts longer. Besides, filling in the entire context window for high-quality, long-context models is ~$1 per call today.
291