๐–๐ก๐š๐ญ ๐ข๐ฌ ๐‹๐‹๐Œ ๐‚๐จ๐ง๐ญ๐ž๐ฑ๐ญ ๐‘๐จ๐ญ ?

๐–๐ก๐š๐ญ ๐ข๐ฌ ๐‹๐‹๐Œ ๐‚๐จ๐ง๐ญ๐ž๐ฑ๐ญ ๐‘๐จ๐ญ ?

Unknown Author
6/3/2026
5 min read
750 views

"LLM context rot" is a phenomenon where the performance of a large language model ๐๐ž๐ ๐ซ๐š๐๐ž๐ฌย as the length of its ๐ข๐ง๐ฉ๐ฎ๐ญ ๐œ๐จ๐ง๐ญ๐ž๐ฑ๐ญ ๐ข๐ง๐œ๐ซ๐ž๐š๐ฌ๐ž๐ฌ.

A recent research at ๐‚๐ก๐ซ๐จ๐ฆ๐š evaluated 18 large language models, including state-of-the-art models like GPT-4.1, Claude 4, Gemini 2.5, and Qwen3.

The researchers at Chroma used a combination of controlled experiments to isolate the effects of context length:-

1๏ธโƒฃ ๐„๐ฑ๐ญ๐ž๐ง๐๐ž๐ ๐๐ž๐ž๐๐ฅ๐ž ๐ข๐ง ๐š ๐‡๐š๐ฒ๐ฌ๐ญ๐š๐œ๐ค (๐๐ˆ๐€๐‡): To go beyond simple lexical matching, they created variations of the NIAH task. This included testing for semantic matches (where the "๐ง๐ž๐ž๐๐ฅ๐ž" was semantically similar but not an exact match to the question) and altering the "๐ก๐š๐ฒ๐ฌ๐ญ๐š๐œ๐ค" content with different distractors.

2๏ธโƒฃ ๐‹๐จ๐ง๐ ๐Œ๐ž๐ฆ๐„๐ฏ๐š๐ฅ: This evaluation involved using long conversational chat histories to test the models' ability to retrieve information.

3๏ธโƒฃ ๐‘๐ž๐ฉ๐ž๐š๐ญ๐ž๐ ๐–๐จ๐ซ๐๐ฌ ๐“๐š๐ฌ๐ค: A simple synthetic task was used to see how models performed on basic text replication as the context length increased.

The research revealed that the ๐š๐ฌ๐ฌ๐ฎ๐ฆ๐ฉ๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐ฎ๐ง๐ข๐Ÿ๐จ๐ซ๐ฆ ๐œ๐จ๐ง๐ญ๐ž๐ฑ๐ญ ๐ฉ๐ซ๐จ๐œ๐ž๐ฌ๐ฌ๐ข๐ง๐  is incorrect and that model performance degrades in surprising and non-uniform ways as the ๐ข๐ง๐ฉ๐ฎ๐ญ ๐ฅ๐ž๐ง๐ ๐ญ๐ก ๐ข๐ง๐œ๐ซ๐ž๐š๐ฌ๐ž๐ฌ.

Some other key findings were:-

โ€ข ๐ƒ๐ž๐ ๐ซ๐š๐๐š๐ญ๐ข๐จ๐ง ๐ฐ๐ข๐ญ๐ก ๐‹๐ž๐ง๐ ๐ญ๐ก: Performance consistently declined across all experiments as the input length grew.

โ€ข ๐’๐ž๐ฆ๐š๐ง๐ญ๐ข๐œ ๐ฏ๐ฌ. ๐‹๐ž๐ฑ๐ข๐œ๐š๐ฅ ๐Œ๐š๐ญ๐œ๐ก๐ข๐ง๐ : Models struggled more with tasks that required semantic understanding and matching compared to those that relied on direct lexical retrieval.

โ€ข ๐ˆ๐ฆ๐ฉ๐š๐œ๐ญ ๐จ๐Ÿ ๐ƒ๐ข๐ฌ๐ญ๐ซ๐š๐œ๐ญ๐จ๐ซ๐ฌ: Distractor content had a significant and non-uniform impact on performance, with the effect becoming more pronounced at longer context lengths.

โ€ข ๐‡๐š๐ฒ๐ฌ๐ญ๐š๐œ๐ค ๐’๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž: In a surprising finding, models performed better when the haystack's sentences were randomly shuffled than when they were presented in a logically coherent structure. This suggests that the model's attention mechanisms can be misled by the surface coherence of the input.

โ€ข ๐๐ž๐ž๐๐ฅ๐ž-๐๐ฎ๐ž๐ฌ๐ญ๐ข๐จ๐ง ๐’๐ข๐ฆ๐ข๐ฅ๐š๐ซ๐ข๐ญ๐ฒ: The rate of performance degradation was accelerated when the similarity between the "needle" (the target information) and the question was lower.

Therefore, this turns out to be yet another instance indicating the importance of ๐œ๐จ๐ง๐ญ๐ž๐ฑ๐ญ ๐ž๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ .

If you want to know more about context engineering, refer to - https://lnkd.in/egmhgHsa

Chroma Research -https://lnkd.in/eBwv_v_h

About the Author

Unknown Author

Unknown Author

AI Expert & Content Creator

Related Posts

Getting Started with AI

Learn the basics of artificial intelligence

Machine Learning Fundamentals

Understanding ML algorithms and applications