The AI memory system MemPalace, developed with the participation of Milla Jovovich, claimed to have scored a perfect 100 in testing and quickly went viral, only to be exposed by the community for suspected cheating and misleading data. Real-world testing found that the results were exaggerated and there were numerous errors. The team has acknowledged the flaws and is working on fixes.
Yesterday (4/7), there was a big news in the AI circle: Hollywood actress Milla Jovovich (known for Resident Evil and The Fifth Element), who is famous in Hollywood, and developer Ben Sigman used Claude Code to help develop the open-source AI memory system “MemPalace.”
For a time, the claim that “Hollywood superstar crosses over to deliver a perfect score project” spread widely. MemPalace has also earned more than 20,000 stars on GitHub so far, but it didn’t take long for the developer community to start questioning: Is there real substance, or is it just hype?
First, let’s talk about MemPalace’s motivation for being created. The official documentation says it aims to solve the limitation that, in today’s AI systems, the content of user-AI conversations, the decision-making process, and architecture discussions typically disappear after the work session ends—leading to months of effort effectively dropping to zero.
To address this problem, MemPalace uses a spatial architecture to store memories, clearly categorizing information into wing zones representing people or projects, as well as different layers such as corridors, rooms, and drawers, preserving the original conversation text for later semantic search.
The development team claims that MemPalace achieved a perfect 100% score on the long-term memory evaluation benchmark LongMemEval, and reached 96.6% accuracy without calling any external APIs. It can run fully locally, without needing to subscribe to cloud services, and is also equipped with an AAAK dialect system said to achieve 30x lossless compression.
Image source: GitHub American film star Milla Jovovich builds an AI memory palace, drawing attention from the outside world
However, MemPalace’s claimed perfect score on LongMemEval quickly drew skepticism from peers.
PenfieldLabs, also developing an AI memory system, pointed out that MemPalace’s claim that it obtained a perfect score on the LoCoMo dataset is impossible mathematically, because the dataset’s standard answers themselves already include 99 incorrect entries.
PenfieldLabs’ analysis found that MemPalace’s 100% score comes from setting the retrieval count to 50 times, but the maximum number of stages in the test dataset’s conversations is only 32. This means the system effectively bypasses the retrieval stage and hands all the data directly to the AI model to read.
Regarding the 100% score on LongMemEval, the development team was found to be targeting three specific problems that were wrong in the development phase, writing dedicated fix code, which raises suspicions of cheating on the test set.
Image source: Reddit Peer PenfieldLabs points out that MemPalace claims a perfect score on the LoCoMo dataset, which cannot happen mathematically
GitHub user hugooconnor commented after real-world testing. MemPalace claims a retrieval accuracy as high as 96.6%, but in practice it didn’t actually use the memory palace architecture it promotes at all. hugooconnor said that their tests simply call the default features of the underlying database ChromaDB and have nothing to do with the categorization logic emphasized by the project—such as wing zones, rooms, or drawers.
After testing, hugooconnor found that when the system truly enables the dedicated categorization logic of these memory palaces, retrieval performance instead declines. For example, in room mode the accuracy drops to 89.4%, and after enabling AAAK compression technology, accuracy drops further to 84.2%—both are lower than the performance of the default database.
hugooconnor also criticized the testing methodology. In MemPalace’s testing environment, the retrieval range for each question is deliberately narrowed to about 50 conversation stages. Finding answers in such a tiny sample library is too easy.
If the range is expanded to more than 19,000 real-world conversation stages, the accuracy of traditional keyword search would plummet to 30%, showing that MemPalace’s current testing approach is concealing the real search challenges.
Image source: GitHub GitHub user real-world testing shows MemPalace benchmark includes misleading elements
At the same time, although the development team has already released a correction statement acknowledging that AAAK technology was indeed validated as lossy compression and promising to revise the documentation and system design based on the community’s harsh criticism, the project’s main description documentation still retains multiple exaggerated claims that have not been corrected. These include claims of 30x lossless compression and a 34% retrieval improvement, and the comparison charts with other competitors also completely lack sources and citations.
As more and more developers download the tests, a large number of bug reports about MemPalace’s source code have appeared on the GitHub platform.
User cktang88 listed multiple serious issues. This includes compression commands being unable to run and causing the system to crash, errors in the logic that counts summary word counts, inaccurate statistical data for digging out rooms, and a problem where the server loads all interpretation data into memory every time it is called—resulting in severe resource consumption issues.
Other reported issues include the system hard-coding developers’ family member names into the default config files, as well as a forced display limit of 10,000 records when checking query status.
To address these problems, the open-source community has already started active fixes. User adv3nt3 submitted multiplefix requests, including correcting the excavation statistics, removing the default family member names, and delaying the initialization time for the knowledge graph. The development team later also acknowledged these errors and is working through community collaboration to resolve the code issues step by step.
For the MemPalace project, a conclusion was posted by Hacker News user darkhanakh: MemPalace gives off the same vibe as OpenClaw—artificially manipulating benchmark results to make it look flawless, and then packaging it as some kind of major breakthrough for marketing.
He believes that while MemPalace’s underlying technology might indeed be interesting, given that the testing methodology has these kinds of flaws and it still promotes itself with “the highest publicly available score of all time,” it’s really not appropriate. “However, as for Milla Jovovich playing Vibe Coding—I still think it’s pretty cool.”
Further reading:
AI writing code goes wrong! A security problem blows up in the convenience store app “Leftover Food Hunter” with expiring items; all the GPS at home goes fully exposed