Milla Jovovich just gamed the AI memory benchmark 👀
Milla Jovovich's MemPalace went viral for a reason. The benchmark scores are real. The 100% headline isn't.
Update — April 11, 2026
Three days after this piece went live, the story moved. Here’s what happened.
The repo nearly doubled — from 23,000 to 41,200 GitHub stars — and shipped v3.1.0. More importantly, the team quietly walked back the 100% claims. The README now leads with 96.6% R@5 on LongMemEval in raw mode, and acknowledges that the community “caught real problems” within hours of launch. The AAAK token example, “30x lossless compression,” and “+34% palace boost” have all been corrected or removed.
An independent benchmark reproduction on an M2 Ultra confirmed what this piece argued: raw mode hits 96.6%, but the palace architecture (wings, rooms, halls) actually makes retrieval worse. The fancy structure regresses performance. ChromaDB is doing the heavy lifting, not MemPalace.
The story also got darker. Kotaku reported that a MemPalace crypto token appeared on pump.fun with a 50% creator reward split between Jovovich and Sigman — pumped and dumped within 24 hours of launch. They also flagged a deleted GitHub account (”aya-thekeeper”) that originally pushed the repo, raising questions about who actually wrote the code. Sigman’s role as CEO of Bitcoin Libre, a crypto lending marketplace, adds context.
Meanwhile, Forbes ran an entirely uncritical puff piece — no mention of the benchmark controversy, the deleted contributor, or the crypto coin. And as of today, the maintainers posted a warning about fake MemPalace websites impersonating the project.
The original analysis below stands. The tech is real. The 96.6% raw score is now independently verified. But the story around it keeps getting stranger.
My first reaction when MemPalace hit r/LocalLLaMA was not “interesting project.” It was LOL WHAT?
Milla Jovovich — Leeloo from The Fifth Element, Alice from Resident Evil — spent months building an AI memory system with developer Ben Sigman. The same problem I’ve been obsessing over for months. The same architecture decisions. The same benchmark targets.
I’m a solopreneur who runs two companies and lives inside AI tools all day. I’ve lost real work because context didn’t carry between tools. I built Mirror Memory because I got tired of re-explaining myself every single session, across every single tool. And apparently, so did a movie star.
We don’t need smarter models anymore. We need models that actually remember who we are.
OK. So what did they actually build?
MemPalace is a genuinely interesting idea. Let me explain what it does before I tell you about the benchmarks.
The project — open-source, MIT-licensed, free — gives AI agents persistent long-term memory. It stores your full conversation history in ChromaDB and SQLite, locally on your machine. No cloud. No subscription. No monthly memory tax. pip install mem-palace and you’re running.
The architecture is inspired by the ancient memory-palace mnemonic technique — the one where you imagine walking through a building and placing memories in specific rooms. In MemPalace, that translates to: wings (people or projects), halls (memory types), rooms (specific topics), closets (compressed summaries), drawers (verbatim originals).

The core insight is smart: rather than burning LLM tokens to decide what’s “worth remembering,” MemPalace keeps everything verbatim and lets semantic search surface it later. That’s the right instinct. It’s also the one that actually holds up in testing — more on that in a moment.
v3.0.0 ships with MCP integration supporting 19 tools, so it’s not just for one agent anymore. Claude, ChatGPT, and others can theoretically connect to it.
Now the benchmarks. This is where it gets interesting.
The headline: 100% on LongMemEval. The reality: a lot of fine print.
LongMemEval (ICLR 2025) is the gold standard benchmark for AI memory systems. It tests five things: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and knowing when to say “I don’t know.” Hard to game. Which makes what happened here worth unpacking.
MemPalace claims 100% on LongMemEval. Developers aren’t buying it. And when you read the GitHub issue dissecting the methodology, it’s hard to argue with them.
Here’s what the audit found:
The 100% LongMemEval score was hand-tuned. According to the issue, the team identified which specific questions the system answered wrong, then engineered targeted fixes — a quoted-phrase boost for one question about “sexual compulsions,” a person-name boost for a question about “Rachel,” pattern matching for a question about high school reunions. Then they re-tested on the same questions and reported a perfect score. The commenter mattneel put it bluntly: “BENCHMARKS.md has an entire integrity section asking them not to do what they did.”

The 100% LoCoMo score retrieves everything. The LoCoMo benchmark’s conversation sessions contain 19–32 items. MemPalace ran it with top_k=50. When your retrieval window is larger than the entire candidate pool, you can’t miss — you literally retrieve every conversation. As the GitHub auditor dial481 wrote: “Setting top_k=50 retrieves the entire conversation. The ‘memory system’ contributes nothing.” The honest LoCoMo ceiling, after correcting ~99 dataset errors, is around 93–94%.
The metrics aren’t what you think. The LongMemEval scores measure recall_any@5 — whether the right memory is retrieved — not whether the system actually answers the question correctly. These are different things. A system that retrieves perfectly but answers wrong still scores 100% under this metric.
Both 100% scores require the Claude API. The marketing leans hard on “no API key, no cloud.” The configurations that hit 100% use paid Claude API calls. Two different things being conflated in the same pitch.

So what’s the real score? The auditor’s honest numbers: 60.3% R@10 with no reranking; 88.9% R@10 with hybrid retrieval and no LLM. That’s still competitive with other local-only memory systems like Mem0 and Zep — it’s not a fraud, it’s a legitimate system with an aggressively marketed ceiling.
The concept is sound. The verbatim-storage approach works. The architecture is thoughtful. The 100% headline is not what it looks like.
Respect the tech. Don’t trust the poster.
So who should install MemPalace?
Builders. People who run local LLM setups, enjoy owning their stack, and want a single-machine memory system they can audit and extend. If that’s you, the 88.9% hybrid score is real, the architecture is elegant, and the price is right.

But most of the people who read this newsletter aren’t that person.
Most of you are operators. You use Claude for one thing, ChatGPT for another, Cursor while you code. MemPalace now has MCP, which means it can technically talk to multiple tools — but it still runs on your local machine. It doesn’t follow you between environments, devices, or team members. Your context stays on whatever machine the ChromaDB lives on.
And here’s the thing nobody mentions when open-source AI tools go viral: they work great on day 1. They break on day 60. ChromaDB needs maintenance. SQLite databases get weird. Dependencies drift. The tool that felt effortless in April becomes a Saturday afternoon by September.
That maintenance tax is invisible in the GitHub star count. It shows up in your calendar.
My honest take: MemPalace is the right answer for the wrong person most of you are trying to be.
The r/LocalLLaMA crowd loves it because they are that person — they build things, they own their infra, they read benchmark BENCHMARKS.md files on Saturday mornings. That community is real and MemPalace serves them well.
But the founders and operators I built Mirror Memory for are different. They want persistent memory that follows them across Claude, ChatGPT, Cursor, and whatever tool ships next week — across devices, not just across tabs. They want it working tomorrow, not after a weekend of configuration. They want something managed, so the maintenance tax never shows up on their calendar.
The three questions that tell you which AI memory solution is actually right for you.
I’ve been deep in this space for months, testing everything from custom RAG pipelines to managed memory APIs. Here’s the framework I wish I’d had when I started:



