The world’s top AI models can pass medical licensing exams, write complex code, and even beat human experts in math competitions, but they repeatedly struggle in a children’s game called “Pokémon.”
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch live stream titled “Claude Playing Pokémon Red” to coincide with the release of Claude Sonnet 3.7.
2,000 viewers flooded into the live stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 can say it “plays” Pokémon, but “playing” does not mean “winning.” It gets stuck at critical points for dozens of hours and makes basic mistakes even young children wouldn’t make.
This is not Claude’s first attempt.
Earlier versions performed even worse: some wandered aimlessly on the map, some fell into infinite loops, and many couldn’t even leave the beginner’s village.
Even with the significant improvements of Claude Opus 4.5, confusing errors still occurred. Once, it circled outside the “Gym” for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands exactly the abilities that current AI most lack: continuous reasoning in open worlds without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are easy for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
01 Toolset Gap Determines Success or Failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a similarly difficult Pokémon game in May 2025. Google’s CEO Sundar Pichai even jokingly said in public that the company has taken a step toward building “artificial Pokémon intelligence.”
However, this result cannot be simply attributed to the Gemini model being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for operating Gemini’s Pokémon live stream, likened the toolset to a “Iron Man suit”: AI does not enter the game empty-handed but is placed within a system capable of calling various external abilities.
Gemini’s toolset offers more support, such as transcribing game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In daily tasks, such differences are not obvious.
When users request the chatbot to perform online searches, the model will automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to the point of determining success or failure.
02 Turn-Based Play Exposes AI’s “Long-Term Memory” Shortcomings
Because Pokémon uses strict turn-based mechanics without real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current screen, target prompts, and available actions, then output clear commands like “Press A.”
This interaction format seems to be what large language models excel at.
The core problem lies in the “disconnection” across time. Although Claude Opus 4.5 has accumulated over 500 hours of operation and executed about 170,000 steps, it is limited by reinitialization after each step, forcing the model to find clues within a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes to maintain cognition, cycling through fragmented information and never achieving the experiential leap from quantitative change to qualitative change like a true human player.
In chess and Go, AI systems have long surpassed humans, but these systems are highly specialized for specific tasks. In contrast, general models like Gemini, Claude, and GPT often beat humans in exams and programming contests but repeatedly fail in children’s games.
This contrast itself is highly instructive.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a single clear goal over long periods. “If you want an intelligent agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he said, “It trains on vast amounts of human data and knows the correct answers. But once it comes to execution, it becomes clumsy.”
In the game, this “knowing but not being able to do” gap is constantly magnified: the model may know it needs to find a certain item but cannot reliably locate it on a 2D map; it may know it should talk to NPCs but repeatedly fails in pixel-level movements.
03 Behind the Capability Evolution: The Uncrossed “Instinct” Gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 significantly outperforms its predecessors in self-recording and visual understanding, enabling it to go further in the game. Gemini 3 Pro completed Pokémon Blue and then tackled the more difficult Pokémon Crystal, winning every battle along the way—something Gemini 2.5 Pro had never achieved.
Meanwhile, Anthropic’s Claude Code toolset allows the model to write and run its own code, which has been used in retro games like RollerCoaster Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: AI equipped with the right toolset may demonstrate high efficiency in software development, accounting, and legal analysis, even if they still struggle with tasks requiring real-time responses.
The Pokémon experiments also reveal another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
In Google’s technical report on Gemini 2.5 Pro, it was noted that when the system simulates a “panic state,” such as Pokémon about to faint, the reasoning quality drops significantly.
When Gemini 3 Pro finally completed Pokémon Blue, it left a non-essential note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and retire the character.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
04 The “Long March” of AI That Cannot Be Crossed Is Much More Than Pokémon
Pokémon is not an isolated case. On the road to developing Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
NetHack: The Abyss of Rules
This 1980s dungeon crawler is a nightmare for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if the model can write code, its performance in NetHack, which requires common sense logic and long-term planning, is far inferior to that of novice humans.
Minecraft: The Vanishing Sense of Purpose
Although AI can craft wooden pickaxes and even mine diamonds, defeating the Ender Dragon remains a fantasy. In open worlds, AI often forgets its original goal during resource gathering that can last dozens of hours or gets completely lost in complex navigation.
StarCraft II: The Gap Between Generality and Specialization
While customized models have defeated professional players, if Claude or Gemini are directly controlled via visual commands, they collapse instantly. Handling the uncertainty of “fog of war” and balancing micro-management with macro-building remain beyond their reach.
RollerCoaster Tycoon: The Imbalance of Micro and Macro
Managing a theme park requires tracking thousands of visitors’ states. Even Claude Code with basic management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
Elden Ring and Sekiro: The Chasm of Physical Feedback
These high-action games are extremely unfriendly to AI. Current visual analysis delays mean that when AI is still “thinking” about a boss’s move, the character may already be dead. Millisecond reaction requirements set the natural upper limit for the model’s interaction logic.
05 Why Has Pokémon Become an AI Benchmark?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams related to this. Google detailed Gemini’s progress in technical reports, Pichai mentioned this achievement at the I/O developer conference, and Anthropic even set up a “Claude Playing Pokémon” display at industry events.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI applications at Anthropic. But he emphasizes that this is more than just entertainment.
Unlike traditional one-off Q&A benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal progression over extended periods, more closely resembling the complex tasks humans want AI to perform in the real world.
So far, AI challenges in Pokémon continue. But these recurring difficulties clearly outline the boundaries of capabilities that general artificial intelligence has yet to cross.
Special Contributor Wu Ji also contributed to this article
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Top global large models can't pass "Pokémon": These games are AI's nightmare
null
Author: Guo Xiaojing, Tencent Technology
Editor | Xu Qingyang
The world’s top AI models can pass medical licensing exams, write complex code, and even beat human experts in math competitions, but they repeatedly struggle in a children’s game called “Pokémon.”
This eye-catching attempt began in February 2025, when an Anthropic researcher launched a Twitch live stream titled “Claude Playing Pokémon Red” to coincide with the release of Claude Sonnet 3.7.
2,000 viewers flooded into the live stream. In the public chat, viewers brainstormed and cheered for Claude, gradually transforming the broadcast into a public observation of AI capabilities.
Sonnet 3.7 can say it “plays” Pokémon, but “playing” does not mean “winning.” It gets stuck at critical points for dozens of hours and makes basic mistakes even young children wouldn’t make.
This is not Claude’s first attempt.
Earlier versions performed even worse: some wandered aimlessly on the map, some fell into infinite loops, and many couldn’t even leave the beginner’s village.
Even with the significant improvements of Claude Opus 4.5, confusing errors still occurred. Once, it circled outside the “Gym” for four days without entering, simply because it didn’t realize it needed to cut down a tree blocking the path.
Why did a children’s game become AI’s Waterloo?
Because Pokémon demands exactly the abilities that current AI most lack: continuous reasoning in open worlds without explicit instructions, recalling decisions made hours earlier, understanding implicit causal relationships, and making long-term plans among hundreds of possible actions.
These tasks are easy for an 8-year-old child but form an insurmountable gap for AI models claiming to “surpass humans.”
01 Toolset Gap Determines Success or Failure?
In comparison, Google’s Gemini 2.5 Pro successfully completed a similarly difficult Pokémon game in May 2025. Google’s CEO Sundar Pichai even jokingly said in public that the company has taken a step toward building “artificial Pokémon intelligence.”
However, this result cannot be simply attributed to the Gemini model being “smarter.”
The key difference lies in the toolset used by the models. Joel Zhang, an independent developer responsible for operating Gemini’s Pokémon live stream, likened the toolset to a “Iron Man suit”: AI does not enter the game empty-handed but is placed within a system capable of calling various external abilities.
Gemini’s toolset offers more support, such as transcribing game visuals into text to compensate for visual understanding weaknesses, and providing customized puzzle-solving and path-planning tools. In contrast, Claude’s toolset is more minimalist, and its attempts more directly reflect the model’s actual perception, reasoning, and execution capabilities.
In daily tasks, such differences are not obvious.
When users request the chatbot to perform online searches, the model will automatically invoke search tools. But in long-term tasks like Pokémon, the differences in toolsets are amplified to the point of determining success or failure.
02 Turn-Based Play Exposes AI’s “Long-Term Memory” Shortcomings
Because Pokémon uses strict turn-based mechanics without real-time reactions, it becomes an excellent “training ground” for testing AI. In each step, AI only needs to reason based on the current screen, target prompts, and available actions, then output clear commands like “Press A.”
This interaction format seems to be what large language models excel at.
The core problem lies in the “disconnection” across time. Although Claude Opus 4.5 has accumulated over 500 hours of operation and executed about 170,000 steps, it is limited by reinitialization after each step, forcing the model to find clues within a very narrow context window. This mechanism makes it more like a forgetful person relying on sticky notes to maintain cognition, cycling through fragmented information and never achieving the experiential leap from quantitative change to qualitative change like a true human player.
In chess and Go, AI systems have long surpassed humans, but these systems are highly specialized for specific tasks. In contrast, general models like Gemini, Claude, and GPT often beat humans in exams and programming contests but repeatedly fail in children’s games.
This contrast itself is highly instructive.
Joel Zhang believes the core challenge for AI is the inability to sustain execution of a single clear goal over long periods. “If you want an intelligent agent to do real work, it can’t forget what it did five minutes ago,” he points out.
And this capability is an essential prerequisite for automating cognitive labor.
Independent researcher Peter Whidden offers a more intuitive description. He open-sourced a Pokémon algorithm based on traditional AI. “AI knows almost everything about Pokémon,” he said, “It trains on vast amounts of human data and knows the correct answers. But once it comes to execution, it becomes clumsy.”
In the game, this “knowing but not being able to do” gap is constantly magnified: the model may know it needs to find a certain item but cannot reliably locate it on a 2D map; it may know it should talk to NPCs but repeatedly fails in pixel-level movements.
03 Behind the Capability Evolution: The Uncrossed “Instinct” Gap
Nevertheless, AI progress is clearly visible. Claude Opus 4.5 significantly outperforms its predecessors in self-recording and visual understanding, enabling it to go further in the game. Gemini 3 Pro completed Pokémon Blue and then tackled the more difficult Pokémon Crystal, winning every battle along the way—something Gemini 2.5 Pro had never achieved.
Meanwhile, Anthropic’s Claude Code toolset allows the model to write and run its own code, which has been used in retro games like RollerCoaster Tycoon, reportedly successfully managing virtual theme parks.
These cases reveal an unintuitive reality: AI equipped with the right toolset may demonstrate high efficiency in software development, accounting, and legal analysis, even if they still struggle with tasks requiring real-time responses.
The Pokémon experiments also reveal another intriguing phenomenon: models trained on human data tend to exhibit behaviors similar to humans.
In Google’s technical report on Gemini 2.5 Pro, it was noted that when the system simulates a “panic state,” such as Pokémon about to faint, the reasoning quality drops significantly.
When Gemini 3 Pro finally completed Pokémon Blue, it left a non-essential note: “To end poetically, I want to return to my original home, have a final conversation with my mother, and retire the character.”
Joel Zhang finds this behavior surprising and somewhat human-like in emotional projection.
04 The “Long March” of AI That Cannot Be Crossed Is Much More Than Pokémon
Pokémon is not an isolated case. On the road to developing Artificial General Intelligence (AGI), developers have found that even if AI can excel in legal exams, it still faces insurmountable “Waterloo” in several complex games.
NetHack: The Abyss of Rules
This 1980s dungeon crawler is a nightmare for AI research. Its high randomness and “permanent death” mechanism make it extremely challenging. Facebook AI Research found that even if the model can write code, its performance in NetHack, which requires common sense logic and long-term planning, is far inferior to that of novice humans.
Minecraft: The Vanishing Sense of Purpose
Although AI can craft wooden pickaxes and even mine diamonds, defeating the Ender Dragon remains a fantasy. In open worlds, AI often forgets its original goal during resource gathering that can last dozens of hours or gets completely lost in complex navigation.
StarCraft II: The Gap Between Generality and Specialization
While customized models have defeated professional players, if Claude or Gemini are directly controlled via visual commands, they collapse instantly. Handling the uncertainty of “fog of war” and balancing micro-management with macro-building remain beyond their reach.
RollerCoaster Tycoon: The Imbalance of Micro and Macro
Managing a theme park requires tracking thousands of visitors’ states. Even Claude Code with basic management capabilities can easily become exhausted when dealing with large-scale financial crashes or emergencies. Any reasoning lapse can lead to park bankruptcy.
Elden Ring and Sekiro: The Chasm of Physical Feedback
These high-action games are extremely unfriendly to AI. Current visual analysis delays mean that when AI is still “thinking” about a boss’s move, the character may already be dead. Millisecond reaction requirements set the natural upper limit for the model’s interaction logic.
05 Why Has Pokémon Become an AI Benchmark?
Today, Pokémon is gradually becoming an informal yet highly convincing benchmark for AI evaluation.
Models from Anthropic, OpenAI, and Google have attracted hundreds of thousands of comments on Twitch streams related to this. Google detailed Gemini’s progress in technical reports, Pichai mentioned this achievement at the I/O developer conference, and Anthropic even set up a “Claude Playing Pokémon” display at industry events.
“We are a group of super tech enthusiasts,” admits David Hershey, head of AI applications at Anthropic. But he emphasizes that this is more than just entertainment.
Unlike traditional one-off Q&A benchmarks, Pokémon can continuously track the model’s reasoning, decision-making, and goal progression over extended periods, more closely resembling the complex tasks humans want AI to perform in the real world.
So far, AI challenges in Pokémon continue. But these recurring difficulties clearly outline the boundaries of capabilities that general artificial intelligence has yet to cross.
Special Contributor Wu Ji also contributed to this article