Supporting China's open-source memory revolution, AI finally gains human-level long-term memory!


With around 100 million tokens of context, a 4-billion-parameter small model outperforms 235-billion RAG! EverMind's open-source MSA has caused a stir.
Have you ever wondered: humans have a memory capacity of about 200-300 million tokens in a lifetime, but today’s GPT and Claude can barely handle 200K-1M, and crash beyond that? No matter how many vector databases you stack, they can't save it. Retrieval is always an external plugin; multi-hop reasoning forgets everything once interrupted; training long-context models consumes exorbitant GPU memory, and inference is painfully slow.
EverMind-AI hits hard, directly smashing through the ceiling. They open-sourced MSA (Memory Sparse Attention), a truly native, built-in, end-to-end trainable long-term memory architecture, pushing LLM’s memory capacity directly to 100 million tokens, with less than 9% accuracy decay!
This isn’t just another long-context trick; it’s a revolutionary design that directly welds the hippocampus into the Transformer.
//
What makes MSA so powerful? Three tricks to beat all predecessors instantly
1. Sparse Attention + Document-wise RoPE
Traditional RoPE drifts in position when dealing with ultra-long sequences. MSA resets position counts independently for each document, enabling seamless extrapolation from 64K to 100M tokens during training. Complexity shifts from O(n²) to approximately O(n), making training and inference linearly scalable.
2. Hierarchical KV Caching + Memory Parallelism
Routing keys (highly compressed version) reside permanently on GPU, while complete KV pairs are stored in CPU memory. During inference, only the top-k relevant documents are fetched—just two A800 GPUs can handle 100M tokens! Official tests show throughput skyrocketing.
3. Memory Interleave Mechanism
No longer just a one-time retrieval; the model iterates its thinking: generate → retrieve → generate again → retrieve again. It dynamically decides how many documents to consider. Multi-hop reasoning (HotpotQA, 2Wiki, etc.) is revived, and ablation experiments show removing it causes a 19%+ drop in accuracy.
In one sentence: MSA fully integrates memory and reasoning into a differentiable closed loop, transforming the process from “look up info then answer” to “think while recalling.” This is the memory approach AGI should have. Data doesn’t lie: 4B models blow away everything else.
The official backbone is Qwen3-4B-Instruct. Compared to similarly scaled RAG, top RAG stacks, HippoRAG2, etc.:
• Average long-context QA score: MSA leads the same backbone RAG by 16%, top RAG stacks by 11.5%.
• MS MARCO (over 70 million tokens): MSA scores 4.141, far surpassing RAG series.
• Multi-hop datasets (HotpotQA, 2Wiki): even more impressive advantage.
• NIAH (needle in a haystack) 1M token: traditional models drop below 25%, MSA maintains over 94% accuracy.
• From 16K to 100M tokens: accuracy decay is less than 9%, while other methods have long since plummeted.
Even more astonishing: a 4B MSA model outperforms RAG systems with 60 times more parameters. This means future agents won’t need 200B+ monster models; just add MSA, and they’ll have memory close to a human lifetime.
The EverMind team clearly regards enabling agents to have personal memory as their core mission, and MSA is their first gift to the world.
GitHub open-source:
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin