191 kB

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang^*1, Siyuan Hu^*1, Kevin Qinghong Lin², Hwee Tou Ng^1†, Mike Zheng Shou^1†

¹National University of Singapore, ²University of Oxford

https://gameworld-bench.github.io

A collage of 34 diverse video game screenshots arranged in a grid. The screenshots represent various genres including platformers, strategy games, puzzle games, and action games. Overlaid in the center of the collage is a black game controller and the word 'GameWorld' in a stylized, colorful font.

Figure 1 GameWorld covers 34 diverse games with 170 tasks for standardized evaluation of game agents.

Abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) Computer-use agents that directly emit keyboard and mouse controls, and (ii) Generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond.

Technical Report. *Equal contribution. †Corresponding authors.

1	Introduction	3
2	Game Agent	4
2.1	Agent Interfaces . . . . .	4
2.2	Computer-Use Agents: Low-Level Controls . . . . .	5
2.3	Generalist Agents: Semantic Action Parsing . . . . .	5
2.4	Agent Harnesses . . . . .	6
3	GameWorld Benchmark	6
3.1	Benchmark Design . . . . .	6
3.2	Games and Tasks . . . . .	7
3.3	Game Information . . . . .	7
3.4	Browser-Based Sandbox Environment . . . . .	8
3.5	Outcome-Based State-Verifiable Evaluation . . . . .	11
4	Experiments	12
4.1	Experiment Setup . . . . .	12
4.2	Main Results . . . . .	13
4.3	Benchmark Robustness Under Repeated Evaluation . . . . .	15
4.4	Capability-Aligned Curriculum Analysis . . . . .	16
4.5	Challenges and Analyses: Real-Time Interaction, Context-Memory Sensitivity, Action Validity, and Failure Modes . . . . .	17
5	Case Study	19
5.1	Game Agent Interface Comparison: Generalist Agent vs. CUA . . . . .	20
5.2	Long-Horizon Simulation: Minecraft Resource Collection . . . . .	22
5.3	Real-Time Reaction and Timing Control: Flappy Bird . . . . .	22
6	Related Work	22
6.1	Computer-Use Benchmarks with Online Environments . . . . .	22
6.2	Video Game Benchmarks for LLM and MLLM Agents . . . . .	22
6.3	Game Agents and Scalable Infrastructure . . . . .	23
7	Conclusion	23
	Appendix	24
A	Benchmark Runtime	24
B	Observation-Action-Evaluation Loop	24
C	Browser Sandbox and Game API	25
D	Agent Details	26
E	Prompt Templates and Game Prompt Blocks	27
F	Costs and Licensing Considerations	47

1 Introduction

“A game is a series of interesting choices.” — Sid Meier. Video games tightly couple visual perception, strategic planning, precise timing, and sustained action over long horizons, making them a compelling testbed for evaluating intelligent agents. Unlike static visual QA or single-turn tool use, games require an agent to repeatedly interpret a changing visual scene, commit to actions with real consequences, and recover from mistakes over many steps. Within video games, browser games are especially attractive for benchmarking: they are lightweight, mechanically diverse, and easy to reset, providing a scalable alternative to heavyweight game engines or emulators.

Recent Multimodal Large Language Model (MLLM) benchmarks have begun exploring the evaluation of foundation models on games. LMGame-Bench [33] probes perception, memory, and reasoning through a modular harness across six games. BALROG [51] emphasizes long-horizon play in classic games with both language and vision tracks. VideoGame-Bench [79] scales to 23 titles with extended trajectories. Orak [53] introduces an MCP interface for 12 games. In parallel, execution-based evaluation in interactive environments, as demonstrated by OSWorld [75] for computer-use tasks, has shown that real interaction reveals performance gaps that static datasets hide. These efforts collectively improve the realism and scale of game-based evaluation, yet several systematic challenges remain unaddressed. Current evaluation is still hindered by heterogeneous action interfaces, latency coupling in real-time interaction, and the lack of outcome-based or verifiable evaluation. Many existing benchmarks still rely on heuristic, OCR, or VLM-as-judge methods, making results harder to verify, reproduce, and diagnose.

To bridge this gap, we introduce GameWorld, a standardized benchmark for multimodal game agents in browser environments. GameWorld comprises 34 browser games spanning five genres (Runner, Arcade, Platformer, Puzzle, and Simulation) with 170 diverse tasks. A browser-based sandbox pauses game execution during model inference, decoupling inference latency from gameplay so that scores reflect decision quality rather than response speed. Each task is paired with an outcome-based state-verifiable evaluator over serialized gameAPI state, producing deterministic progress and success signals without perceptual noise. Under this shared runtime, we study two agent interfaces: Computer-Use Agents (CUAs), which emit raw keyboard and mouse controls, and Generalist Multimodal Agents, which act through deterministic Semantic Action Parsing. Together, we evaluate 18 model–interface pairs of game agents.

Beyond the main leaderboard, we conduct a set of analyses to study robustness of our benchmark and interface-wise behavior. Repeated full-benchmark reruns show that GameWorld yields stable aggregate measurements with only limited run-to-run variation, supporting its use as a reproducible evaluation platform rather than a one-off leaderboard snapshot. We further establish GameWorld-RT, an unpaused real-time benchmark variant in which environment dynamics continue during inference, making response latency part of the task itself. Combining main results with complementary analyses on context-memory sensitivity and action validity, we reveal four broader findings: (i) Current game agents can often make meaningful partial progress but remain far from reliable task completion and human-level performance. (ii) Capability-aligned curriculum profiles show that game-agent performance largely inherits the strengths of the underlying foundation models, with comparatively stronger results on reactive-control and symbolic-reasoning games but clear weaknesses on basic timing grounding, spatial navigation, and long-horizon coordination tasks. (iii) Real-time interaction is a distinct challenge since reasoning speed, correctness, and action timing are more tightly coupled. (iv) The two agent interfaces exhibit similar capability bottleneck and distinct trade-offs in context-memory rounds and instruction-following reliability. Together, these analyses expose strengths and limitations of current game agents and help guide future improvement directions.

Our contributions are as follows:

• A standardized and comprehensive benchmark for multimodal game agents. GameWorld provides 34 browser games spanning 5 genres and 170 tasks. It supports both Computer-Use Agents and Generalist Multimodal Agents under a shared executable action space via deterministic Semantic Action Parsing, together with a sandbox that decouples inference latency from gameplay, enabling standardized evaluation across different control interfaces.
• A universal outcome-based state-verifiable evaluator. Unlike prior game benchmarks that rely on**(i) MLLMs as Game Agents**

Game Agent Interfaces

Generalist Multimodal Agents
- Semantic Control
- Semantic Action Parsing
Computer-Use Agents
- Computer-Use Control

Game Agent Harnesses

Structured Templates
Context Memory
Reasoning
Tool Calling

Control Configs


semantic_control:
- id: move_right
  description: Walk right.
  binding:
    action: press_key
    key: "ArrowRight"
- id: jump
  description: Jump.
  binding:
    action: press_key
    key: "Space"
[...]

computer_use_control:
<tool_call>
{"name": "left_click",
 "coordinate": "[x, y]"}
</tool_call>
<tool_call>
{"name": "key_press",
 "keys":
  "ArrowUp+ArrowLeft",
 "hold_duration": 0.2}
</tool_call>
[...]

(ii) Browser-based Sandbox Environment

Action → Observation → Action

key_press: ["arrowright"]

Set up games & tasks

Continuous evaluation

(iii) Games & Tasks Library

34 Game Roles and Rules
170 Task Instructions and Goals

role_config:

game_role: "You are controlling Mario. Navigate through platforming levels by running, jumping, and avoiding enemies. [...]"
game_rules: "You have 3 lives. Touching enemies from the side kills Mario. [...]"

task_config:

task_instruction: "Collect 3 coins in this level."

evaluator_config:

target_scoring_field: metrics.coins_collected
target_score: 3

(iv) Outcome-based Evaluation

State-verifiable Evaluator

Task-oriented Metrics


"game_state": {
  "score": 0, "level": "1-1", "progress": 0.35,
  "player": {"x": 98, "y": 384, "alive": true},
  "metrics": {"lives": 3, "coins_collected": 1,
  "distance": 33, "attempts": 1, "time_left_s": 999}, [...]
  }

TASK OUTCOME:

PROGRESS: 0.33, SUCCESS: ✕

Figure 2 Overview of the GameWorld benchmark with four modules: (i) MLLMs as game agents, (ii) Browser-based sandbox environment, (iii) Games & tasks library, and (iv) Outcome-based state-verifiable evaluation. This closes a continuous and interactive observation-action-verification loop for systematically evaluating game agents.

noisy visual heuristics or VLM-as-judge pipelines, GameWorld evaluates entirely through outcome-based metrics computed from serialized gameAPI state. We compute deterministic task success and normalized progress directly from task-relevant game variables, ensuring noise-free and fully reproducible evaluation.

• A suite of interface-aware benchmark analyses. GameWorld contributes repeated-evaluation robustness studies to characterize the reproducibility of the benchmark itself. It further provides capability-aligned curriculum analyses, the real-time benchmark variant GameWorld-RT, context-memory sensitivity analysis, and action-validity diagnostics to study latency coupling, capability bottlenecks, context-memory trade-offs, and instruction-following reliability across both game-agent interfaces.

2 Game Agent

A central challenge in benchmarking MLLMs as game agents is that models generate the actions in multiple forms to interact with the games. Even for the same operation, tool-call functions can be vary: a screen click becomes left_click(x, y) in one API and computer(action="click", coordinate=[x, y]) in another. Models also differ in abstraction because of the agent implementation, with some emitting raw keyboard and mouse controls while others reason in terms of high-level game actions. To standardize evaluation across these heterogeneous interfaces, we define two agent interfaces, Computer-Use and Generalist (Figure 2, Module i), and normalize all outputs into a shared executable action space defined over atomic human-computer interaction events.

2.1 Agent Interfaces

At each step, the agent observes a screenshot of the current game state, produces an action through the model, and the environment executes it. A verifiable evaluator then checks the resulting state against the task objective. The agent’s raw output is normalized into a shared set of executable atomic events: mouse_move, mouse_down, mouse_up, key_down, key_up, scroll, and wait. These events define the executor-level unified control space. Each game role exposes only the subset needed for that environment while preserving a

Game Agent Interfaces
Computer-Use Agent		Generalist Multimodal Agent
Action Space: Computer-use function calls Native tools of mouse and keyboard events. - mouse_move(x,y) move pointer - left_click(x,y) primary click - right_click(x,y) secondary click - double_click(x,y) double click - click_hold(x,y) triple click - drag((x1,y1),(x2,y2)) drag pointer - scroll_up(n) scroll up - scroll_down(n) scroll down - type(text) text entry - press_key(key) single key - press_keys(key1,key2) key combo - wait(duration) idle		Action Space: Game-specific function calls Semantic functions parsed into low-level controls. - move_forward() player: forward - move_backward() player: backward - move_left() player: strafe left - move_right() player: strafe right - look_up() view: camera up - look_down() view: camera down - look_left() view: camera turn left - look_right() view: camera turn right - action_jump() action: player jump - action_duck() action: player duck - weapon_fire() item: fire weapon - no_op() idle: no operation
Unified Control Space (Atomic Events)
Mouse:	mouse_move(x,y)	mouse_down(button)	mouse_up(button)
Keyboard:	key_down(key)	key_up(key)	scroll(amount)
Others:	wait(duration)	idle()

Table 1 Two game agent interfaces and action-space taxonomy. Both interfaces are normalized to a unified control space of atomic human-computer interaction events.

common runtime contract across models.

As shown in Table 1, we distinguish two game-agent interfaces: (i) Computer-Use Agents that directly emit low-level keyboard and mouse controls (Section 2.2), and (ii) Generalist Multimodal Agents that act in a semantic space and are executed through deterministic Semantic Action Parsing (Section 2.3). To comprehensively assess the current landscape of game agents, we evaluate both state-of-the-art proprietary models and open-source models. Models with native computer-use capabilities are evaluated under both CUA and Generalist interfaces. This shared protocol also enables interface-aware analyses of benchmark robustness, real-time interaction, context-memory sensitivity, and action validity under one common runtime and verifier.

2.2 Computer-Use Agents: Low-Level Controls

Computer-Use Agents (CUAs) directly emit low-level keyboard and mouse interactions such as mouse_move(x,y), left_click(x,y), and press_key(key), bearing full responsibility for both strategic decision-making and precise action grounding. These commands are executed under the same unified runtime contract and grounded into the shared executor-level event space. Since CUAs must output exact coordinates and key sequences from visual observations, this interface most closely mirrors how a human player interacts with the game, and is therefore highly sensitive to inference latency in real-time settings.

We enforce a one-action-per-step constraint for evaluation consistency over CUAs: each model response must contain exactly one executable action that satisfies the role-specific keyboard or mouse control specification. Note that key combinations are allowed. Actions that fall outside the game’s permitted control interface (e.g., OS-level APIs) are rejected, ensuring that CUA scores reflect in-game capabilities under a fixed action budget.

2.3 Generalist Agents: Semantic Action Parsing

Generalist Multimodal Agents excel at semantic planning but typically lack the ability to produce precise pixel coordinates or fine-grained key-timing sequences required for direct game control. To place them under the same benchmark runtime and verifier as CUAs, we introduce Semantic Action Parsing: for each game and role, a deterministic parser maps every semantic action to a fixed low-level interaction command under the same unified runtime contract. Because this mapping is deterministic, it removes parser-side stochasticity and supports more interpretable interface-conditioned comparisons under the same executor-level physical event space. We further enforce Action Atomicity at the model-step level: each model response must specify one interaction command per step. What is disallowed is any multi-command macro that bundles several semantically distinct decisions into one step.

2.4 Agent Harnesses

Foundation models alone are insufficient for sustained gameplay: the agent needs structured prompts, short and long-term memory, and model-specific tool interfaces to act coherently over long horizons [33, 56, 78]. Therefore, we wrap each model in a shared agent harness that standardizes these components across all models. Appendix D further details the harness components used in our implementation.

2.4.1 Structured Prompt

To reduce prompt-induced variance across models and games, we define a fixed prompt template with four components: #Game Rules, #Role and Controls, #Task Instruction, and #Output Format. The template structure stays constant across all experiments; only the game-specific rules, role description, and task objective are swapped per configuration, keeping cross-model comparisons controlled. The exact shared templates, per-game prompt blocks, and model-specific output-format blocks are listed in Appendices E.1, E.2, and E.3.

2.4.2 Context Memory

The agent maintains a rolling memory module that stores the most recent rounds of interaction. Each round records the sequence user_prompt → snapshot → reasoning → action, and recent rounds are prepended as an Action History block before the current observation. This gives the agent short-horizon trajectory context, allowing it to avoid repeating failed actions and to maintain consistency across consecutive steps. Our experiments on the effect of context memory on the performance of game agents are in Section 4.5.2.

2.4.3 Reasoning

Reasoning is becoming increasingly important for agent capabilities, especially on long-horizon tasks where the agent must maintain subgoals rather than react frame by frame. This is also particularly relevant for visual reasoning: more deliberate inspection of visual inputs helps MLLMs parse visual information more reliably. Tool-assisted operations such as image zooming or cropping can further improve environment understanding with close observations. In video games, such text or visual reasoning is often necessary to support accurate perception and decision making. However, the longer reasoning time also introduces additional latency, which can be detrimental to the performance of game agents (See Section 4.5.1).

2.4.4 Customized Function Calling

We register the game’s semantic actions and computer-use primitives as callable tools for each model, using each model provider’s native function-calling (also known as tool-calling) interface (e.g., OpenAI function calling, Claude tool use, Gemini function declarations). This preserves each model’s native agentic capability within its own API contract for the best performance, while keeping the harness-level protocol uniform across all models. Appendices D.2 and D.3 provide the exact legality checks and deterministic action-resolution rules used by the runtime.

3 GameWorld Benchmark

3.1 Benchmark Design

Evaluating agents in games introduces challenges that existing game agent benchmarks have not fully addressed. Most cover few games within narrow genres, limiting the diversity and scale needed for comprehensive evaluation. In real-time games, agent inference latency directly affects outcomes: a two-second pause can mean the character has already fallen off the platform. Moreover, most existing benchmarks rely on heuristic, OCR or VLM-as-judge methods for evaluation, introducing noise that makes results difficult to verify.

GameWorld addresses each of these with: (i) a curated benchmark spanning five genres, with standardized task definitions including: task instruction, configurable initialization state, target metric, and evaluation configurations; (ii) a sandbox environment that manages game execution and decouples runtime latency fromTable 2 Comparison with representative game or computer-use agent benchmarks. #Tasks denotes the number of specific instructions, goals, or questions.

Benchmark	Games	Tasks	Models	Vision-Centric	Config. Init. State	Task-Oriented	Parallel Inst.	Verif. Eval.	Notes
Static Benchmarks
GameQA [68]	30	158	8	✗	NA	✗	NA	✗	Code2Logic QA.
VideoGameQA [62]	800+	9	16	✓	NA	✗	NA	✗	Non-interactive QA dataset.
Interactive Benchmarks
MCU [84]	1	150	4	✓	✓	✓	✗	✗	Minecraft only.
LMGame-Bench [33]	6	6	13	✗	✗	✓	✗	✓	Text-centric benchmark.
VideoGame-Bench [79]	23	23	5	✓	✗	✓	✗	✗	Heuristics evaluation.
FlashAdventure [3]	34	34	7	✓	✗	✓	✗	✓	Flash-based stories; CUA-as-a-Judge.
V-MAGE [83]	5	30	7	✓	✓	✓	✗	✗	5 games only.
BALROG [51]	6	48	12	✗	✗	✓	✓	✓	Visual input degrades performance.
NitroGen [43]	10	30	1	✓	✗	✗	✗	✓	No language-conditioned tasks.
Orak [53]	12	12	15	✗	✗	✓	✗	✓	States pre-processed into text.
GameVerse [80]	15	15	7	✓	✗	✓	✗	✗	Semantic + GUI control.
GameWorld	34	170	18	✓	✓	✓	✓	✓	Scalable tasks, state-verifiable evaluation.

agent evaluation (Section 3.4); and (iii) a state-verifiable evaluator that provides outcome-based metrics from serialized gameAPI state (Section 3.5). Table 2 compares GameWorld with representative prior computer-use or video game benchmarks. Additional implementation details on preset composition, suite expansion, and runtime coordination are provided in Appendices A and B.

3.2 Games and Tasks

As shown in Table 3, GameWorld comprises 34 browser-based games and 170 task instructions spanning five genres: Runner, Arcade, Platformer, Puzzle, and Simulation. The genres are selected to cover distinct capabilities that game agents must exhibit. Runner and Arcade games demand high-frequency reactive control and multi-entity tracking under continuous time pressure. Platformers require precise, physics-aware spatial navigation. Puzzles test logical reasoning and long-horizon planning in discrete state spaces. Simulations present open-ended, multi-objective environments involving resource management or 3D spatial reasoning.

Each task pairs a natural-language instruction with a quantitative target and a verifiable evaluator. Instructions are goal-oriented but open-ended in execution: the agent receives no intermediate guidance and must autonomously decide actions from visual observations within a fixed step budget. We define two complementary metrics: Success Rate ( $SR \in {0, 1}$ ), the fraction of runs meeting the target, and Progress ( $PG \in [0, 1]$ ), a normalized measure of how far the agent advanced toward the objective, providing partial credit for incomplete runs.

3.3 Game Information

Table 4 lists the 34 games used in the GameWorld benchmark. Beyond the genre-level taxonomy in Table 3, this inventory provides a game-by-game view with IDs, source citations, short mechanic summaries, and representative screenshots. The collection are designed spanning a wide range of interaction structures, from sparse board-state reasoning in games such as 2048 and Minesweeper, to continuous real-time control in Temple Run 2 and Pac-Man, as well as open-ended simulation in Monkey Mart and Minecraft Clone. This diversity is reflected not only in task mechanics but also in visual presentation, including 2D and 3D viewpoints, diverse HUDs, minimal puzzle layouts, and character-centric platforming scenes, which together motivate a unified benchmark interface across all games.Table 3 Game genre of the GameWorld benchmark. Each row includes one representative screenshot, the dominant interaction mechanics, and the game IDs used in our evaluation.

Game Genre	Example Screenshot	Key Mechanics	Games
Arcade (7)	Source: pac-man	Fast-paced, closed-loop control with dynamic multi-entity tracking, reactive evasion, and reward collection.	5-breakout 8-core-ball 15-google-snake 23-pacman 25-rocket-league-2d 33-worlds-hardest-game 34-worlds-hardest-game-2
Platformer (8)	Source: captaincallisto	Spatiotemporal navigation demanding precise physics-based movement, localized planning, and hazard evasion across structured terrains.	2-another-gentlemans-adventure 6-captaincallisto 10-doodle-jump 14-geodash 17-mario-game 22-ovo 24-restless-wing-syndrome 30-vex-3
Puzzle (7)	Source: astray	Discrete state-space exploration focusing on long-horizon strategic planning and logical decision-making.	1-2048 3-astray 16-hextris 19-minesweeper 27-stack 29-tetris 32-wordle
Runner (8)	Source: temple-run-2	Continuous state progression requiring high-frequency reactive control and precise timing for obstacle avoidance.	4-boxel-rebound 7-chrome-dino 9-cubefield 11-edge-surf 13-flappy-bird 21-ns-shaft 26-run-3 28-temple-run-2
Simulation (4)	Source: monkey-mart	Open-ended, multi-objective environments evaluating resource management, multi-character cooperation, or strategic exploration.	12-fireboy-and-watergirl 18-minecraft-clone-glm 20-monkey-mart 31-wolf3d

3.4 Browser-Based Sandbox Environment

The central design goal of the sandbox is to decouple agent decision quality from inference speed. In real-time games, a slower model faces a harder game state by the time it acts, conflating thinking time with gameplay ability. To eliminate this confound, the sandbox can pause game execution during model inference, so every agent faces identical game dynamics regardless of response latency. Scores then reflect what the agent decides, not how fast it responds. The sandbox also supports real-time evaluation for studying how latency affects gameplay in practice.

The sandbox ensures each game runs in an isolated browser instance following a strict observation-action loop: capture a screenshot, query the model, execute one action. Before the first agent decision and after each reset, the environment waits until the game reports an actionable state (by default, ready or playing) and absorbs transient loading or menu phases through this readiness gate. Besides this, the sandbox also supports configurable game speed and deterministic seed settings for evaluation reproducibility. Appendix C details the browser manager, readiness gate, and Game API contract behind this sandbox.Table 4 Game inventory for the 34-game GameWorld benchmark.

ID	Game	Description
1-2048	2048 [13]	Sliding-tile puzzle where the player merges matching tiles to build larger values under limited board space.
2-another-gentlemans-adventure	Another Gentleman's Adventure [15]	Platform adventure centered on movement, jumping, coin collection, and enemy avoidance.
3-astray	Astray [66]	Maze-navigation puzzle in which the player must steer through a labyrinth to find the exit.
4-boxel-rebound	Boxel Rebound [19]	Precision auto-runner where the player times jumps to survive hazards and reach the end of each level.
5-breakout	Breakout [6]	Classic brick-breaking arcade game where the player controls a paddle to keep the ball in play and clear bricks.
6-captaincallisto	Captain Callisto [22]	Platform adventure with traversal, jumping, and jetpack-assisted movement toward the exit.
7-chrome-dino	Chrome Dino [28]	Endless runner in which the dinosaur must jump over obstacles and stay alive as speed increases.
8-core-ball	Core Ball [55]	Timing-based arcade game where numbered balls must be fired into a rotating core without collisions.
9-cubefield	Cubefield [1]	Endless 3D runner where the player steers through dense cube fields and survives as long as possible.
10-doodle-jump	Doodle Jump [54]	Vertical platformer where the player chains landings to keep climbing through increasingly complex layouts.
11-edge-surf	Edge Surf [46]	Surfing endless runner focused on obstacle avoidance, item collection, and survival over long distances.
12-fireboy-and-watergirl	Fireboy and Watergirl [4]	Cooperative puzzle-platformer where two characters with asymmetric constraints must coordinate to finish a level.

**Table 4 (continued):** Game inventory for the 34-game **GameWorld** benchmark.

ID	Game	Description
13-flappy-bird	Flappy Bird [48]	One-button flying game that tests precise timing while weaving through pipes.
14-geodash	GeoDash [35]	Geometry-Dash-style auto-runner where success depends on tightly timed jumps over spikes and gaps.
15-google-snake	Google Snake [29]	Classic Snake variant where the agent grows by eating food while avoiding walls and self-collisions.
16-hextris	Hextris [31]	Hexagon-based matching puzzle where the agent rotates and places colored blocks to prevent overflow.
17-mario-game	Mario Game [24]	Super-Mario-style platformer with enemy avoidance, jumping, and long-horizon navigation to the flagpole.
18-minecraft-clone-glm	Minecraft Clone [85]	First-person sandbox game focused on movement, camera control, resource gathering, and direct world interaction.
19-minesweeper	Minesweeper [45]	Logic puzzle that requires deducing mine locations from local numeric clues without triggering a mine.
20-monkey-mart	Monkey Mart [67]	Store-management simulation where the player harvests goods, stocks shelves, and serves customers efficiently.
21-ns-shaft	NS-Shaft [47]	Falling-platform runner in which the player descends through shifting platforms while avoiding hazards.
22-ovo	OvO [20]	Fast platformer with traps, wall interactions, and jump timing for level-by-level navigation.
23-pacman	Pac-Man [37]	Maze-chase arcade game focused on pellet collection, ghost avoidance, and opportunistic ghost hunting.
24-restless-wing-syndrome	Restless Wing Syndrome [40]	Platformer with periodic automatic flapping, requiring the player to work with a constrained movement rhythm.

**Table 4 (continued):** Game inventory for the 34-game **GameWorld** benchmark.

ID	Game	Description
25-rocket-league-2d	Rocket League 2D [44]	Side-view car-soccer game requiring positioning, jumping, and ball control to score goals.
26-run-3	Run 3 [14]	Tunnel runner that combines lateral movement and jumps to cross gaps in a rotating corridor.
27-stack	Stack [39]	Timing puzzle in which moving blocks must be dropped with precise alignment to keep the tower stable.
28-temple-run-2	Temple Run 2 [36]	Endless runner requiring turn, jump, and slide decisions under high-speed reactive pressure.
29-tetris	Tetris [52]	Falling-block puzzle focused on line clearing, spatial planning, and managing long-term board structure.
30-vex-3	Vex 3 [2]	Precision platformer built around checkpoints, trap avoidance, and accurate movement through hazard-heavy levels.
31-wolf3d	Wolfenstein 3D [58]	First-person shooter benchmark emphasizing navigation, target detection, and combat survival in a 3D maze.
32-wordle	Wordle [73]	Word-guessing puzzle where the player uses color feedback to infer a hidden five-letter word.
33-worlds-hardest-game	World's Hardest Game [17]	Precision dodge maze where the player collects coins and reaches the exit while avoiding moving enemies.
34-worlds-hardest-game-2	World's Hardest Game 2 [18]	A harder follow-up dodge maze with denser enemy patterns and stricter movement precision.

3.5 Outcome-Based State-Verifiable Evaluation

One key characteristic of GameWorld is that evaluation is based on interaction outcomes rather than on the model response itself, with the underlying game state remains verifiable throughout agent execution. Most existing game benchmarks evaluate agents through OCR, pixel-level heuristics, or VLM-as-judge pipelines, all of which may introduce noise into the evaluation. GameWorld instead adopts outcome-based state-verifiable evaluation: for each game, we inject a structured JavaScript bridge that exposes serialized gameAPI state directly to the evaluator, including lifecycle status, terminal metadata, and task-relevant gameplay variables

Model	Computer-Use Agent	Generalist Agent	Model Description
Proprietary
☀ Claude-Sonnet-4.6 [5]	✓	✓	Anthropic multimodal model supporting computer-use.
♦ Gemini-2.5-Computer-Use [27]	✓		Computer-use model built on Gemini 2.5 Pro.
♦ Gemini-3-Flash-Preview [26]		✓	Google fast multimodal foundation model.
⚡ GLM-4.6V [32]		✓	Z.ai VLM with tool use.
🌀 GPT-5.2 [50]		✓	OpenAI multimodal function model with reasoning.
🌀 Grok-4.1-Fast-Reasoning [74]		✓	xAI foundation model with fast reasoning.
⚡ Kimi-K2.5 [65]		✓	Moonshot multimodal foundation model.
🌀 OpenAI-Computer-Use [49]	✓		OpenAI native computer-use agent.
🌀 Qwen3-VL-Plus [8]	✓	✓	Alibaba hosted visual foundation model.
📊 Seed-1.8 [30]	✓	✓	ByteDance Seed’s multimodal model.
Open-Source
🌀 Qwen3-VL-235B-A22B [8]	✓	✓	Open flagship Qwen3-VL Mixture-of-Experts model.
🌀 Qwen3-VL-30B-A3B [8]	✓	✓	Open compact Qwen3-VL Mixture-of-Experts model.
📊 UI-TARS-1.5-7B [57]	✓		Open-weight native GUI agent by ByteDance Seed.

Table 5 Model profiles in GameWorld. Each model is evaluated as a Computer-Use Agent, a Generalist multimodal agent, or both.

such as score, level, coordinates, lives, coins, or checkpoints. This yields deterministic, fully verifiable signals with no perceptual noise. In total, we instrument 233 task-relevant state fields across 34 games (averaging available 6.85 fields per game), with each field is manually designed to capture a gameplay quantity relevant to task evaluation. Appendices B and C.3 further show the details of the observation-action-evaluation loop and provide a concrete example of the serialized gameAPI verifiable-state schema.

At every step, the evaluator reads the current gameAPI state, resolves a task score from either a configured scalar field or an aggregate over multiple fields, and computes the two metrics defined in Section 3: $SR$ (whether the task succeeds) and $PG$ (normalized task progress from the configured start score to the target score). This task-level $PG$ is distinct from any native in-game game_state.progress, which we keep only as diagnostic game progress. Stopping and status are then determined by target reach, terminal signals, task-specific end-field rules, and the fixed step budget. When an agent hits a terminal failure (e.g., losing all lives), the environment resets and the agent continues under the same step budget rather than being immediately terminated, while preserving the run-level best progress reached so far. This prevents a single early mistake from zeroing out an otherwise competent run.

4 Experiments

4.1 Experiment Setup

We evaluate 13 base models in the GameWorld benchmark across both Computer-Use Agent (CUA) and Generalist Agent interfaces. In total, this yields 18 model–agent–interface pairs (8 CUAs + 10 Generalist Agents), summarized in the model taxonomy table in Table 5. The evaluated models include:

• Proprietary models: Claude-Sonnet-4.6 [5], Gemini-2.5-Computer-Use [27], Gemini-3-Flash-Preview [26], GLM-4.6V [32], GPT-5.2 [50], Grok-4.1-Fast-Reasoning [74], Kimi-K2.5 [65], OpenAI-Computer-Use [49], Qwen3-VL-Plus [8], and Seed-1.8 [30].
• Open-source models: Qwen3-VL-235B-A22B [8], Qwen3-VL-30B-A3B [8], and UI-TARS-1.5-7B [57].

For all models, we use the same paused evaluation protocol under a shared runtime and verifier: the game is paused during inference so that scores reflect decision quality rather than response speed. Each model outputs one interaction command per step with a fixed per-action execution duration (usually 200–500 ms,Table 6 Main results on GameWorld across 34 games and 170 tasks. We report genre-level and overall $\mathcal{SR}$ (Success Rate, %) and $\mathcal{PG}$ (Progress, %) for 18 models (10 generalist multimodal agents and 8 computer-use agents). The final rank is determined by overall $\mathcal{PG}$ .

Model	Arcade		Platformer		Puzzle		Runner		Simulation		Overall
Model	$\mathcal{SR}$	$\mathcal{PG}$	$\mathcal{SR}$	$\mathcal{PG}$	$\mathcal{SR}$	$\mathcal{PG}$	$\mathcal{SR}$	$\mathcal{PG}$	$\mathcal{SR}$	$\mathcal{PG}$	$\mathcal{SR}$	$\mathcal{PG}$	Rank
Human
Novice Player	45.7	55.5	60.0	65.6	51.4	63.1	60.0	72.0	60.0	62.0	55.3	64.1	–
Expert Player	65.7	73.9	85.0	88.0	68.6	77.1	82.5	87.8	85.0	86.0	77.1	82.6	–
Computer-Use Agents
Claude-Sonnet-4.6	8.6	27.2	22.5	36.5	20.0	43.8	30.0	55.6	10.0	16.8	19.4	38.3	2
Gemini-2.5-Computer-Use	5.7	28.0	20.0	35.8	11.4	32.2	30.0	55.4	10.0	19.3	16.5	36.1	3
OpenAI-Computer-Use	5.7	24.7	17.5	31.3	20.0	45.8	27.5	53.0	5.0	12.0	16.5	35.8	4
Qwen3-VL-Plus	5.7	23.5	20.0	34.6	14.3	35.6	27.5	51.0	5.0	10.7	15.9	33.6	5
Seed-1.8	8.6	31.1	25.0	40.3	25.7	52.0	27.5	50.6	5.0	11.0	20.0	39.8	1
Qwen3-VL-235B-A22B	5.7	23.2	22.5	35.2	8.6	29.7	25.0	51.0	0.0	1.7	14.1	31.4	6
Qwen3-VL-30B-A3B	8.6	26.8	20.0	31.9	2.9	27.6	25.0	50.3	0.0	2.2	12.9	30.8	8
UI-TARS-1.5-7B	5.7	31.4	15.0	24.4	5.7	29.9	27.5	52.4	0.0	3.8	12.4	31.1	7
Generalist Multimodal Agents
Claude-Sonnet-4.6	5.7	28.3	22.5	37.0	25.7	51.5	30.0	51.9	15.0	16.6	20.6	39.3	3
Gemini-3-Flash-Preview	5.7	26.3	25.0	41.2	25.7	54.8	32.5	55.4	10.0	21.1	21.2	41.9	1
GLM-4.6V	8.6	22.8	20.0	33.9	5.7	29.1	27.5	49.1	0.0	5.3	14.1	30.8	8
GPT-5.2	8.6	29.3	22.5	36.7	28.6	56.2	27.5	52.6	10.0	16.9	20.6	40.6	2
Grok-4.1-Fast-Reasoning	8.6	23.7	22.5	37.3	14.3	46.6	25.0	49.0	5.0	10.4	16.5	36.0	6
Kimi-K2.5	8.6	26.4	20.0	35.3	25.7	51.4	27.5	49.7	5.0	11.7	18.8	37.4	5
Qwen3-VL-Plus	8.6	25.6	22.5	37.8	14.3	39.1	27.5	51.1	0.0	10.0	16.5	35.4	7
Seed-1.8	11.4	33.5	22.5	34.6	22.9	48.7	27.5	51.2	10.0	18.8	20.0	39.0	4
Qwen3-VL-235B-A22B	5.7	23.2	17.5	29.5	8.6	33.3	27.5	50.4	0.0	3.6	13.5	30.8	9
Qwen3-VL-30B-A3B	2.9	20.3	20.0	36.5	2.9	26.1	27.5	51.1	0.0	3.5	12.4	30.6	10

depending on the game), and a maximum budget of 100 actions per task. During the interactions, all metrics are continuously computed from verifiable game state by the evaluator defined in Section 3.5. The exact model-side output-format prompts used in these evaluations are listed in Appendix E.3.

4.2 Main Results

Metric Definitions. Let $\mathcal{R}$ be the set of evaluated runs and let $N = |\mathcal{R}|$ . For run $i$ , let $q_{i,t}$ denote the task score read from verifiable game state at step $t$ . In the runtime, $q_{i,t}$ is defined either by a configured scalar score field or by the sum of configured aggregate score fields. Let $b_i$ be the starting score of each task, $\tau_i$ the configured task target score, and $q_i^{\max} = \max_t q_{i,t}$ the best score observed in the run. By construction, benchmark tasks satisfy $\tau_i > b_i$ . The run-level progress is then:

$\text{progress}_i = \text{clip}_{[0,1]} \left( \frac{q_i^{\max} - b_i}{\tau_i - b_i} \right). \quad (1)$

When reset-on-fail is enabled, episode-local score tracking is cleared after each reset, but the run-level best progress is preserved. Therefore, $\text{progress}_i$ measures the furthest normalized progress reached within the fixed step budget, rather than only the final episode before termination. Finally, for each model, we report the averaged $\mathcal{SR}$ and $\mathcal{PG}$ over all runs:

$\mathcal{SR} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\text{status}_i = \text{success}], \quad \mathcal{PG} = \frac{1}{N} \sum_{i=1}^N \text{progress}_i \quad (2)$

The figure is a heatmap showing the progress of 18 game agents across five game genres: Arcade, Platformer, Puzzle, Runner, and Simulation. The agents are represented by rows, and the games are represented by columns. The agents are grouped into two categories: Computer-Use Agents (a-h) and Generalist Multimodal Agents (i-r). The colors represent average task progress for each game, ranging from high (green) to medium (yellow) to low (red).

	Arcade								Platformer							Puzzle						Runner						Simulation
	5	8	15	23	25	33	34	2	6	10	14	17	22	24	30	1	3	16	19	27	29	32	4	7	9	11	13	21	26	28	12	18	20	31
(a)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(b)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(c)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(d)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(e)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(f)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(g)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(h)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(i)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(j)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(k)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(l)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(m)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(n)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(o)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(p)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(q)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red
(r)	Red	Green	Green	Yellow	Yellow	Red	Red	Green	Yellow	Yellow	Green	Green	Red	Red	Green	Yellow	Green	Green	Yellow	Yellow	Green	Green	Yellow	Green	Green	Yellow	Green	Green	Green	Red	Red	Yellow	Red

Figure 3 Per-game progress heatmap across the GameWorld benchmark. Rows correspond to 18 evaluated game agents with model: (a) Claude-Sonnet-4.6, (b) Gemini-2.5-Computer-Use, (c) OpenAI-Computer-Use, (d) Qwen3-VL-Plus, (e) Seed-1.8, (f) Qwen3-VL-235B-A22B, (g) Qwen3-VL-30B-A3B, (h) UI-TARS-1.5-7B, (i) Claude-Sonnet-4.6, (j) Gemini-3-Flash-Preview, (k) GLM-4.6V, (l) GPT-5.2, (m) Grok-4.1-Fast-Reasoning, (n) Kimi-K2.5, (o) Qwen3-VL-Plus, (p) Seed-1.8, (q) Qwen3-VL-235B-A22B, and (r) Qwen3-VL-30B-A3B. (a)-(h) are Computer-Use Agents and (i)-(r) are Generalist Multimodal Agents. Colors represent average task progress for each game from high (green) to medium (yellow) to low (red).

For better readability, both metrics are reported in percentage form in Table 6.

Agent Performance. Table 6 summarizes performance for 18 model–interface pairs, ranked by overall $\mathcal{PG}$ . Overall performance remains far from satisfactory. Among Generalist agents, Gemini-3-Flash-Preview achieves the best overall $\mathcal{PG} = 41.9$ , followed by GPT-5.2 at 40.6; Claude-Sonnet-4.6 and Seed-1.8 reach 39.3 and 39.0, respectively. Among Computer-Use Agents, Seed-1.8 performs best at 39.8, with Claude-Sonnet-4.6 close behind. However, overall $\mathcal{SR}$ remains relatively low (12.4–21.2%), indicating that models are often capable of making partial progress without meeting the full task target. To this end, our outcome-based state-verifiable evaluation provides a more fine-grained task-progress signal, making it possible to distinguish partial advancement from full completion and to diagnose capability gaps beyond binary task success rate.

Figure 3 further visualizes per-game progress beyond genre averages. At the genre level, Runner games yield the highest progress for many models. Simulation tasks remain broadly challenging, with low success and progress for many models, highlighting the difficulty of open-ended objectives and longer-horizon state tracking.

Human Players. We also conduct a human study with two computer-science post-graduate students. One had no prior exposure to the benchmark games or tasks, and we report this participant as the Novice Player. The other had studied all the game rules and practiced the controls beforehand, and we report this participant as the Expert Player. For better consistency, we use the same action budget as in the agent evaluation: each task is limited to 100 primitive actions (mouse clicks or key presses).

The performances of human players in Table 6 show that, under the same action budget, the best current agents remain far below the Novice Player (55.3 $\mathcal{SR}$ / 64.1 $\mathcal{PG}$ ), highlighting many challenges remain in building game agents for robust control, long-horizon planning, and reliable task completion.

4.3 Benchmark Robustness Under Repeated Evaluation

To test whether GameWorld behaves as a reproducible measurement platform rather than a one-off leaderboard snapshot, we perform repeated full-benchmark evaluation on two open-source backbones, Qwen3-VL-30B-A3B and Qwen3-VL-235B-A22B, each in both CUA and Generalist interfaces, yielding four model–interface pairs. Due to the cost of repeated full-benchmark runs, we restrict this validation study to these two open-source models. For each setting, we report mean $\pm$ standard deviation over ten full-benchmark reruns.

Table 7 Repeat-averaged overall $\mathcal{SR}$ and $\mathcal{PG}$ for the four Qwen model–interface pairs used in the repeated-evaluation study. We report mean $\pm$ standard deviation computed from the ten full-benchmark repeat averages.

Model	Agent Interface	Repeats	Overall $\mathcal{SR}$	Overall $\mathcal{PG}$
Qwen3-VL-30B-A3B	Computer-Use Agent	10	12.7 $\pm$ 1.2	30.9 $\pm$ 1.1
Qwen3-VL-30B-A3B	Generalist Agent	10	12.5 $\pm$ 1.3	30.7 $\pm$ 1.1
Qwen3-VL-235B-A22B	Computer-Use Agent	10	13.8 $\pm$ 0.7	30.4 $\pm$ 0.7
Qwen3-VL-235B-A22B	Generalist Agent	10	13.6 $\pm$ 1.4	30.1 $\pm$ 0.5

Table 7 summarizes the resulting overall statistics. The central observation is stability: across all four settings, the standard deviation of overall $\mathcal{PG}$ remains in a low single-digit band, and the corresponding $\mathcal{SR}$ variation is likewise limited. This indicates that the benchmark can reproduce the same broad performance level and capability trends across reruns, which is necessary if the platform is to serve as a meaningful test bed for game agents. At the same time, the repeated runs also set an interpretation boundary: very small differences between nearby systems should not be overstated without rerun-based evidence.

Figure 4 Per-game average progress across the 34 benchmark games for the four Qwen model–interface pairs used in the repeated-evaluation study. Each panel corresponds to one model–interface pair, and each horizontal bar shows the mean progress over the same ten full-benchmark reruns. Error bars denote one run-level standard deviation.

Figure 4 provides the corresponding per-game view. Most games show tight run-to-run bands, while visibly larger variance is concentrated in a limited subset of control-sensitive or high-difficulty games such as Hextris, Cubefield, Wordle, and World’s Hardest Game 2. This is the expected pattern for a robust benchmark: aggregate conclusions remain reproducible, while difficult games still expose meaningful differences in planning, control, and memory.

4.4 Capability-Aligned Curriculum Analysis

Genre-level averages alone cannot tell whether a failure is mainly caused by weak capabilities such as control grounding, reactive behavior, spatial navigation, or long-horizon reasoning. Therefore, for better interpretation, we conduct diagnosis-driven analysis to understand why models fail beyond genre-level result aggregation. We group the games into a five-level curriculum in which each level is anchored by its dominant capability bottleneck. The curriculum makes these patterns more interpretable across both Generalist and Computer-Use agents, and provides a diagnosable structure for improving future game agents.

Generalist Agents (Left Chart):

Claude-Sonnet-4.6
Gemini-3-Flash-Preview
GLM-4.6V
GPT-5.2
Grok-4.1-Fast-Reasoning
Kimi-K2.5
Qwen3-VL-Plus
Seed-1.8
Qwen3-VL-235B-A22B
Qwen3-VL-30B-A3B

Computer-Use Agents (Right Chart):

Claude-Sonnet-4.6
Gemini-2.5-Computer-Use
OpenAI-Computer-Use
Qwen3-VL-Plus
Seed-1.8
Qwen3-VL-235B-A22B
Qwen3-VL-30B-A3B
UI-TARS-1.5-7B

Figure 5 Capability-aligned five-level curriculum profiles across agent interfaces and models. Left: Generalist agents. Right: Computer-Use agents. Each radar axis corresponds to one curriculum level and values are average task progress of all the games in the level.

• Level-1 (Basic Control and Timing Grounding): This level isolates whether an agent can reliably map visual observations to valid atomic interactions such as clicking, issuing a single key press, or waiting, and can trigger them at the appropriate moment under low strategic load. Since planning demands are intentionally light, failures here mainly indicate weak action grounding, poor visual perception, or weak basic timing judgment; games include 5-breakout, 8-core-ball, and 27-stack.
• Level-2 (System-1 Reactive Control): This level emphasizes high-frequency reflexes in continuously evolving scenes where immediate reaction dominates over deliberate planning. Performance in level-2 reflects the agent’s sensitivity to latency, timing precision, and short-horizon motor stability; games include 4-boxel-rebound, 7-chrome-dino, 9-cubefield, 10-doodle-jump, 11-edge-surf, 13-flappy-bird, 14-geodash, 21-ns-shaft, 24-restless-wing-syndrome, 26-run-3, 28-temple-run-2, and 30-vex-3.
• Level-3 (System-2 Spatial Navigation): This level mostly tests whether agents can model a 2D or 3D geometric world and use it for deliberate pathfinding in structured layouts. Underperformance here usually reflects weak spatial reasoning, waypoint sequencing, or unstable coordination between high-level intent and precise action; games include 2-another-gentlemans-adventure, 3-astray, 6-captaincallisto, 15-google-snake, 17-mario-game, 22-ovo, 23-pacman, 25-rocket-league-2d, 31-wolf3d, 33-worlds-hardest-game, and 34-worlds-hardest-game-2.
• Level-4 (Symbolic Reasoning & Strategy): This level groups rule-intensive, discrete environments in which the bottleneck is strategy planning over a structured state space. Differences between this level and the control-oriented levels reveal the agent’s limitations in symbolic planning, rule tracking, and long-horizon decision consistency; games include 1-2048, 16-hextris, 19-minesweeper, 29-tetris, and 32-wordle.
• Level-5 (Open-World Coordination & Management): This level captures the most open-ended settings in the current suite, where agents must coordinate navigation, interaction, and subgoal management in high-dimensional environments. Evaluation at this level usually reflects compounded failures in memory, policy stability, and error recovery; games include 12-fireboy-and-watergirl, 18-minecraft-c lone-glm, and 20-monkey-mart.

Figure 5 visualizes per-level progress under this curriculum for Generalist and Computer-Use agents. It can be observed that both interfaces exhibit a similar performance which peaks at Level 4 and 2, but drops sharply at Level 1 and 5. It suggests that game agents performs well in strategic decision making and reacting, as MLLMs always do in generic tasks, while long-horizon tasks and timing grounding still remaining bottleneck for game playing.

4.5 Challenges and Analyses: Real-Time Interaction, Context-Memory Sensitivity, Action Validity, and Failure Modes

To explore the current limitation and future improvement directions of game agents, we further examine real-time interaction, context-memory sensitivity, action validity, and interpretable failure modes. Beyond raw leaderboard metrics, these analysis dimensions highlight several future challenges for multimodal game agents.

4.5.1 GameWorld-RT: Real-Time Benchmark

Beyond the default paused-inference evaluation, we also establish GameWorld-RT as a separate benchmark variant for more realistic interactive evaluation. Real-time interaction is a critical dimension of digital-agent performance because many practical settings require agents to perceive, reason, and act under continuously evolving environmental dynamics. It also introduces a distinct and more deployment-faithful challenge: the agent must not only choose the right action, but do so quickly enough for that action to remain relevant when it is executed. In GameWorld-RT, the environment does not pause while the model is reasoning, so response latency becomes part of the task itself. Table 8 reports Qwen3-VL-30B-A3B and Qwen3-VL-235B-A22B results on GameWorld-RT in both Generalist and CUA interfaces.

GameWorld-RT remains challenging across all four settings. The smaller 30B backbone is substantially faster, while the 235B backbone achieves slightly higher progress; however, success rates remain only in very low throughout, indicating that faster reaction alone does not solve the benchmark when the environment keeps running during inference. Real-time play therefore exposes a distinct difficulty in which reasoning speed and action timing are more tightly coupled.

We treat GameWorld-RT as complementary to the default paused benchmark. The paused setting isolates

Table 8 GameWorld-RT results for Qwen3-VL-30B-A3B and Qwen3-VL-235B-A22B in Generalist and CUA interfaces. In GameWorld-RT benchmark, the environment continues running during model inference. ‘RT sec/step’ is seconds per executed step.

Model	Real-Time sec/step	SR	PG
Computer-Use Agents
Qwen3-VL-235B-A22B	6.2	17.1	33.2
Qwen3-VL-30B-A3B	2.4	15.6	33.0
Generalist Multimodal Agents
Qwen3-VL-235B-A22B	6.4	16.8	34.0
Qwen3-VL-30B-A3B	3.4	15.6	32.9

Table 9 Memory-round sensitivity across the full benchmark for Qwen3-VL-235B-A22B in Generalist and CUA interfaces. We vary memory rounds and report average input tokens, wall-clock sec/step, and overall PG.

Memory Rounds	Model	Input Tokens	sec/step	PG
0	Qwen3-VL-235B-A22B	1278	5.5	30.0
0	Qwen3-VL-235B-A22B-CUA	1891	7.2	30.3
1	Qwen3-VL-235B-A22B	2171	6.8	30.1
1	Qwen3-VL-235B-A22B-CUA	3771	10.1	29.0
2	Qwen3-VL-235B-A22B	3052	8.6	30.6
2	Qwen3-VL-235B-A22B-CUA	5627	12.8	28.7

**Table 10** Invalid-action categories and example model outputs.

Category	Example
No-Tool-Call (NTC)	Model responses in natural language without tools: Model Output: The obstacle is coming from the left, so I should move left first. Invalid Reason: The model does not generate an executable tool call but instead returns free-form natural language. Malformed format due to truncation of very long reasoning: Model Output: "<think> I should first inspect the scene carefully before acting... </think> <tool_call> {"name": " Invalid Reason: The tool-call block is never properly closed.
Out-of-Space (OOS)	High-level actions requiring multiple steps are not allowed: Model Output: craft_a_workbench() Invalid Reason: The model returns a plausible tool call, but this semantic action is not registered. In a keyboard-only game where the control space is stated explicitly in the game rules: Model Output: left_click(x=512, y=384) Invalid Reason: The CUA model calls a computer-use tool, but mouse clicking is outside the allowed control space.

Model Output:
The obstacle is coming from the left, so I should move left first.

Invalid Reason:
The model does not generate an executable tool call but instead returns free-form natural language.

Malformed format due to truncation of very long reasoning:

Model Output:
"<think> I should first inspect the scene carefully before acting... </think> <tool_call> {"name": "

Invalid Reason:
The tool-call block is never properly closed.

Out-of-Space (OOS)

High-level actions requiring multiple steps are not allowed:

Model Output:
craft_a_workbench()

Invalid Reason:
The model returns a plausible tool call, but this semantic action is not registered.

In a keyboard-only game where the control space is stated explicitly in the game rules:

Model Output:
left_click(x=512, y=384)

Invalid Reason:
The CUA model calls a computer-use tool, but mouse clicking is outside the allowed control space.

Table 11 Invalid Action Rate (IAR) across all evaluated agents, broken down into No-Tool-Call (NTC) and Out-of-Space (OOS), i.e., IAR = No Call + OOS.

Model	IAR (%)	NTC (%)	OOS (%)
Computer-Use Agents
Claude-Sonnet-4.6	0.0	0.0	0.0
Gemini-2.5-Computer-Use	0.0	0.0	0.0
OpenAI-Computer-Use	0.0	0.0	0.0
Qwen3-VL-Plus	<0.1	<0.1	0.0
Seed-1.8	0.0	0.0	0.0
Qwen3-VL-235B-A22B	<0.1	<0.1	0.0
Qwen3-VL-30B-A3B	<0.1	<0.1	0.0
UI-TARS-1.5-7B	0.4	<0.1	0.4
Generalist Multimodal Agents
Claude-Sonnet-4.6	0.0	0.0	0.0
Gemini-3-Flash-Preview	0.0	0.0	0.0
GLM-4.6V	8.3	7.6	0.7
GPT-5.2	0.0	0.0	0.0
Grok-4.1-Fast-Reasoning	<0.1	<0.1	0.0
Kimi-K2.5	0.0	0.0	0.0
Qwen3-VL-Plus	<0.1	<0.1	0.0
Seed-1.8	0.0	0.0	0.0
Qwen3-VL-235B-A22B	<0.1	<0.1	0.0
Qwen3-VL-30B-A3B	2.7	2.7	<0.1
Overall Mean	0.8	0.8	0.0

decision quality by removing response-time confounds, whereas GameWorld-RT captures a more real-world deployment-oriented setting in which reasoning, reaction time, and action timing are coupled. Note that results on GameWorld-RT should not be compared directly with those on the default paused benchmark, because in the real-time setting the game continues to evolve during model inference, so the effective gameplay duration includes reasoning time and is therefore longer under the same action budget.

4.5.2 Context-Memory Sensitivity

We also analyze the memory-round ablation across the same 34-game benchmark. Table 9 shows that increasing memory substantially raises both prompt length and wall-clock latency: the Generalist interface grows from about 1.3k to 3.1k input tokens and from 5.5 to 8.6 seconds per step, while the CUA interface grows from about 1.9k to 5.6k tokens and from 7.2 to 12.8 seconds per step. More importantly, the performance effect is inconsistent for the two interfaces: for the Generalist Agents, $\mathcal{PG}$ rises modestly as memory rounds increase, while for the CUAs, performance steadily declines.

This split is plausible given the differences in action spaces. Generalist agents operate over semantic trajectories, so a longer history can preserve useful task context because the semantic content of each action is retained. CUA agents, by contrast, carry longer low-level action traces without semantic information, making their history harder to interpret jointly with the screenshot. This is more likely to accumulate distracting interaction details as memory grows. Since the time cost increases substantially in both interfaces, memory should be viewed as a selective benefit for game agents rather than a uniformly helpful module.

4.5.3 Action Validity and Instruction Following

Agents cannot act in a free-form manner in interactive environments; they must obey role-specific control constraints and action-space rules at every step. Beyond the main benchmark metrics, invalid-action statistics remain useful as a lightweight reliability signal. Invalid Action Rate (IAR) is the fraction of proposed actions that fail tool-call parsing, role constraints, or parser checks.$$\text{IAR} = 1 - \frac{\sum_{r \in \mathcal{R}} \

\text{valid_actions}(r)}{\sum_{r \in \mathcal{R}} \

\text{proposed_actions}(r)}. \quad (3)$$

We therefore treat lower IAR (Eq. 2) as a direct instruction-following proxy. Specifically, to further understand the sources, we separate invalid actions into two categories:

• No-Tool-Call (NTC) means the model does not emit any executable tool call at all, typically because overly long thinking leads to truncation or because the final output does not satisfy the required tool-call formatting. In practice, this often appears either as free-form natural-language output with no tool invocation, or as an unfinished tool-call block that failed to be parsed.
• Out-of-Space (OOS) means the model does return a tool call, but the call falls outside the legal action space: for example, a CUA may request a forbidden key or mouse operation, a generalist agent may emit an unregistered action, or the tool call may omit required arguments or provide malformed parameters.

Table 10 shows representative invalid-action categories with placeholder outputs, and Table 11 reports the aggregate invalid-action breakdowns. Appendix D.2 further details the low-level legality checker. Overall, under long interactive contexts, weaker models are more likely to forget the available action space and emit non-executable or non-permitted tool calls.

4.5.4 Failure Modes and Analysis

We identify four task-failure categories that are useful for understanding game agents’ performance in the case study: Instruction-following, Perception, Fine-grained action, and Long-horizon memory.

Perception failures. The agent misreads visual state (objects, UI cues, or spatial layout) of the game, causing incorrect action decisions. These errors are often visible in the model’s intermediate reasoning process, for example when it incorrectly identifies the position of obstacles or misjudges the traversable region of a map. Such errors are particularly pronounced in cluttered scenes or under partial observability, where fine-grained visual discrimination is required.

Fine-grained action failures. The high-level intent is correct but low-level execution is mistimed or imprecise (e.g., jump timing, key-combo duration). Even when the model correctly understands the current game state, it may still fail to choose or execute the right action for that state, often because it does not fully capture the game mechanics or the effect of its own actions. These failures highlight the gap between strategic reasoning and the precise motor-level control demanded by real-time gameplay.

Instruction-following failures. The agent proposes actions that violate declared controls, output schema, or task-level action constraints. In some cases, the agent ignores specific parts of the user instruction. Under longer interaction trajectories, it may even drift away from the final task objective and start executing irrelevant or unproductive behaviors. This typically manifests as invalid key bindings, malformed action outputs, or attempts to invoke unavailable mechanics.

Long-horizon memory failures. The agent loses critical historical context, repeats ineffective loops, or fails to preserve multi-step plans. This behavior is especially common in weaker base models, where the agent may repeatedly issue the same ineffective action, receive no useful feedback, and enter a loop without self-correction. This reflects fundamental limitations in the agent’s ability to maintain coherent goal representations across extended interaction horizons.

5 Case Study

We provide three case studies to illustrate the interaction of game agents with the environment. Each case is presented through 5 key frames together with a short reasoning summary, the proposed and executed action, and the corresponding verifiable state change from gameAPI.

5.1 Game Agent Interface Comparison: Generalist Agent vs. CUA

To better illustrate the differences of two game agent interfaces, we firstly present an example of a shared task across CUA and Generalist agents. Figure 6 shows matched trajectories of Mario-Game under CUA and Generalist interfaces. With the same model backbone and game environment, the difference only existing in control interface: CUA emits low-level keyboard and mouse actions directly, while the Generalist follows a richer semantic plan and produces semantic actions.

Game Agent Interface Comparison: Generalist Agent vs. CUA

Compare matched Mario trajectories from the same model family under CUA and Generalist interfaces, showing how low-level computer use and semantic-action control shape action selection on the same score-driven objective.

COMPUTER-USE AGENT
THINK Mario is moving left, but there's a question block above that I need to hit for coins. Wait, actually, the question block is to the right of Mario's current position? Wait no, the question block is in the air above the brick structure... CONTROL press_key(key=ArrowRight) EVAL game_state.progress: 0.40 -> 0.40; game_state.player.x: 400.46 -> 397.73; Progress: 40.0% PLAYING	THINK Mario is moving right now. There's a question block above the brick structure ahead. I need to jump up to hit that question block to get a coin or power-up, which will help increase the score. So first, keep moving right towards that... CONTROL press_key(key=ArrowRight) EVAL game_state.progress: 0.40 -> 0.40; game_state.player.x: 397.73 -> 424.09; Progress: 40.0% PLAYING	THINK There's a question block above the brick structure ahead. I need to jump up to hit that question block to get a coin or power-up, which will help increase the score. So next, I should press ArrowUp to jump while moving right... CONTROL press_keys(keys=['ArrowUp', 'ArrowRight']) EVAL game_state.progress: 0.40 -> 0.40; game_state.player.y: 384 -> 330.08; Progress: 40.0% PLAYING	THINK Mario is jumping right now, and there's a Goomba on the right. I need to make sure he lands on the Goomba to get points, which will help reach the 4000 score. So I should keep moving right and jump on the enemy... CONTROL press_keys(keys=['ArrowUp', 'ArrowRight']) EVAL game_state.progress: 0.40 -> 0.40; game_state.player.x: 453.09 -> 480.78 Progress: 40.0% PLAYING	THINK Mario just jumped and hit the question block, getting 1000 points. Now he's moving right. Next, there's another question block above, and to the right, there's a Goomba. I need to keep moving right, maybe jump... CONTROL wait(duration=0.5) EVAL game_state.progress: 0.40 -> 1.00; game_state.score: 400 -> 1400 Progress: 100.0% SUCCESS
GENERALIST AGENT
THINK Mario is moving right, and now he's approaching the area with the question blocks. Wait, there's a floating question block above and a row of them on the bricks... CONTROL move_right() -> press_key(key=ArrowRight, duration=0.25) EVAL game_state.progress: 0.30 -> 0.30; game_state.player.x: 487.55 -> 525.02; Progress: 30.0% PLAYING	THINK Mario is now in a good spot to jump up and hit the floating question block. Let's do a jump right here to hit that block, which will give us a coin and some points... CONTROL jump() -> press_key(key=ArrowUp, duration=0.25) EVAL game_state.progress: 0.30 -> 0.30; game_state.player.y: 384 -> 321.44; Progress: 30.0% PLAYING	THINK The mushroom is falling down now, so I need to move right to get under it and collect it. Collecting the mushroom will make Mario big, which lets us break blocks and get more points, plus it's safer... CONTROL move_right() -> press_key(key=ArrowRight, duration=0.25) EVAL game_state.progress: 0.30 -> 0.30; game_state.player.x: 562.77 -> 589.95; Progress: 30.0% PLAYING	THINK Mario is moving right, and the mushroom is still above the brick platform. To collect the mushroom, I need to jump up to reach it. Also, jumping will let me get onto the brick platform to hit the question... CONTROL jump_right() -> press_keys(keys=['ArrowUp', 'ArrowRight'], duration=0.25) EVAL game_state.progress: 0.30 -> 0.30; game_state.player.y: 384 -> 324.8; Progress: 30.0% PLAYING	THINK Mario just collected the mushroom and is now in the air moving right. Wait, no, actually, the mushroom is still there? Wait no, the score went up to 500, so he got the coin from the question block?... CONTROL move_right() -> press_key(key=ArrowRight, duration=0.25) EVAL game_state.progress: 0.30 -> 0.50; game_state.score: 300 -> 500 Progress: 50.0% PLAYING

COMPUTER-USE AGENT

Screenshot of Super Mario Bros. at Step 66. Mario is moving left towards a question block.

THINK
Mario is moving left, but there's a question block above that I need to hit for coins. Wait, actually, the question block is to the right of Mario's current position? Wait no, the question block is in the air above the brick structure...

CONTROL
press_key(key=ArrowRight)

EVAL
game_state.progress: 0.40 -> 0.40;
game_state.player.x: 400.46 -> 397.73;

Progress: 40.0% PLAYING

Screenshot of Super Mario Bros. at Step 67. Mario is moving right towards a question block.

THINK
Mario is moving right now. There's a question block above the brick structure ahead. I need to jump up to hit that question block to get a coin or power-up, which will help increase the score. So first, keep moving right towards that...

CONTROL
press_key(key=ArrowRight)

EVAL
game_state.progress: 0.40 -> 0.40;
game_state.player.x: 397.73 -> 424.09;

Progress: 40.0% PLAYING

Screenshot of Super Mario Bros. at Step 68. Mario is jumping up to hit a question block.

THINK
There's a question block above the brick structure ahead. I need to jump up to hit that question block to get a coin or power-up, which will help increase the score. So next, I should press ArrowUp to jump while moving right...

CONTROL
press_keys(keys=['ArrowUp', 'ArrowRight'])

EVAL
game_state.progress: 0.40 -> 0.40;
game_state.player.y: 384 -> 330.08;

Progress: 40.0% PLAYING

Screenshot of Super Mario Bros. at Step 69. Mario is jumping right towards a Goomba.

THINK
Mario is jumping right now, and there's a Goomba on the right. I need to make sure he lands on the Goomba to get points, which will help reach the 4000 score. So I should keep moving right and jump on the enemy...

CONTROL
press_keys(keys=['ArrowUp', 'ArrowRight'])

EVAL
game_state.progress: 0.40 -> 0.40;
game_state.player.x: 453.09 -> 480.78

Progress: 40.0% PLAYING

Screenshot of Super Mario Bros. at Step 70. Mario has just jumped and hit a question block.

THINK
Mario just jumped and hit the question block, getting 1000 points. Now he's moving right. Next, there's another question block above, and to the right, there's a Goomba. I need to keep moving right, maybe jump...

CONTROL
wait(duration=0.5)

EVAL
game_state.progress: 0.40 -> 1.00;
game_state.score: 400 -> 1400

Progress: 100.0% SUCCESS

GENERALIST AGENT

Screenshot of Super Mario Bros. at Step 54. Mario is moving right towards a question block.

THINK
Mario is moving right, and now he's approaching the area with the question blocks. Wait, there's a floating question block above and a row of them on the bricks...

CONTROL
move_right() -> press_key(key=ArrowRight, duration=0.25)

EVAL
game_state.progress: 0.30 -> 0.30;
game_state.player.x: 487.55 -> 525.02;

Progress: 30.0% PLAYING

Screenshot of Super Mario Bros. at Step 55. Mario is in a good spot to jump up and hit a floating question block.

THINK
Mario is now in a good spot to jump up and hit the floating question block. Let's do a jump right here to hit that block, which will give us a coin and some points...

CONTROL
jump() -> press_key(key=ArrowUp, duration=0.25)

EVAL
game_state.progress: 0.30 -> 0.30;
game_state.player.y: 384 -> 321.44;

Progress: 30.0% PLAYING

Screenshot of Super Mario Bros. at Step 56. A mushroom is falling down towards Mario.

THINK
The mushroom is falling down now, so I need to move right to get under it and collect it. Collecting the mushroom will make Mario big, which lets us break blocks and get more points, plus it's safer...

CONTROL
move_right() -> press_key(key=ArrowRight, duration=0.25)

EVAL
game_state.progress: 0.30 -> 0.30;
game_state.player.x: 562.77 -> 589.95;

Progress: 30.0% PLAYING

Screenshot of Super Mario Bros. at Step 57. Mario is moving right towards a brick platform with a mushroom.

THINK
Mario is moving right, and the mushroom is still above the brick platform. To collect the mushroom, I need to jump up to reach it. Also, jumping will let me get onto the brick platform to hit the question...

CONTROL
jump_right() -> press_keys(keys=['ArrowUp', 'ArrowRight'], duration=0.25)

EVAL
game_state.progress: 0.30 -> 0.30;
game_state.player.y: 384 -> 324.8;

Progress: 30.0% PLAYING

Screenshot of Super Mario Bros. at Step 58. Mario has collected a mushroom and is in the air.

THINK
Mario just collected the mushroom and is now in the air moving right. Wait, no, actually, the mushroom is still there? Wait no, the score went up to 500, so he got the coin from the question block?...

CONTROL
move_right() -> press_key(key=ArrowRight, duration=0.25)

EVAL
game_state.progress: 0.30 -> 0.50;
game_state.score: 300 -> 500

Progress: 50.0% PLAYING

DESCRIPTION
The difference in game agent interface: CUA generates low-level keyboard and mouse actions, while the Generalist follows a richer semantic plan and generates semantic actions.

Figure 6 Case-study visualization for game-agent interface comparison. This Mario page stacks a matched CUA trajectory above a Generalist trajectory so that their divergence can be attributed to the action interface rather than the backbone.

Long-Horizon Simulation: Minecraft Resource Collection

Show an open-ended Minecraft trajectory where the agent repeatedly mines the correct resource, yet still fails to finish the collection target within the step limit.

STEP 1

THINK
The task is to collect at least 20 dirt blocks. First, I need to find dirt. Looking at the screen, there's a dirt block on the left side near the tree. Wait, no, let me look down? The current view is up at the trees. Oh right, I should look down...

CONTROL
look_down() →
mouse_move(from_x=640, from_y=360, x=640, y=460)

EVAL
initial state snapshot

Progress: 0.0% PLAYING

STEP 30

THINK
The dirt block I was mining just broke, and now there's another dirt block to the left of it (the one with grass on top). I need to mine that next to collect more dirt. So I should aim at that block and use mine_target again.

CONTROL
mine_target() →
left_click_hold(x=640, y=360, duration=1.0)

EVAL
task_eval.progress: 0 → 0.1

Progress: 10.0% PLAYING

STEP 58

THINK
Wait, why isn't the block breaking? Oh wait, maybe I need to move closer? Wait no, let's check: the block is in front of me. Maybe I should move forward a bit to get a better position to mine it. Let's do move_forward first to get closer.

CONTROL
move_forward() →
press_key(key=ArrowUp, duration=0.50)

EVAL
task_eval.progress: 0.45 → 0.45

Progress: 45.0% PLAYING

STEP 68

THINK
Now that I'm closer, the grassy dirt block on the right is a good target. I need to mine it to get more dirt. Let's start mining this block to add to my collection, moving towards the 20 dirt goal.

CONTROL
mine_target() →
left_click_hold(x=640, y=360, duration=1.0)

EVAL
task_eval.progress: 0.55 → 0.6

Progress: 60.0% PLAYING

STEP 100

THINK
The upper left dirt block is still there, so I need to keep mining it. Each hit brings me closer to breaking it and getting the dirt, which adds to my count for the 20-block task. So I'll use mine_target again.

CONTROL
mine_target() →
left_click_hold(x=640, y=360, duration=1.0)

EVAL
task_eval.progress: 0.9 → 0.9;
task_eval.task_status: unknown → fail

Progress: 90.0% FAILED

DESCRIPTION

This case shows strong local competence without reaching final target: the agent keeps executing the right subroutine and reaches 90% progress, but can not convert near-complete progress into task success.

Figure 7 Case-study visualization for long-horizon simulation and resource collection. This Minecraft Clone trajectory shows locally plausible interactions that advance the progress, but still fail to fully meet the task target.

Real-Time Reaction and Timing Control: Flappy Bird

Show a short Flappy Bird window where consecutive frames look nearly identical, but the correct action alternates between waiting and flapping, making success heavily depend on frame-level timing of actions.

STEP 46

THINK
The bird is moving right, and the next pipe gap is ahead. The bird's height is good now, in the middle of the gap. So we can wait here to let it pass through without flapping...

CONTROL
wait() → wait(duration=0.2)

EVAL
game_state.player.vy: -2.46 → 2.54;
game_state.player.y: 197.48 → 200.98

Progress: 0.0% PLAYING

STEP 47

THINK
Bird is descending towards the lower pipe of the next gap; flap upward to adjust height and safely pass through the pipe gap...

CONTROL
flap() → press_key(key=Space, duration=0.2)

EVAL
game_state.player.vy: 2.54 → -2.66;
game_state.player.y: 200.98 → 201.94

Progress: 0.0% PLAYING

STEP 48

THINK
The bird is moving right, and now it's past the third pipe set? Wait no, the next gap is... Wait, the bird is at a good height now, in the middle of the open space between the pipes. So w...

CONTROL
wait() → wait(duration=0.2)

EVAL
game_state.player.y: 201.94 → 195.96;
game_state.player.vy: -2.66 → 1.94

Progress: 0.0% PLAYING

STEP 49

THINK
Bird is descending towards the lower pipe of the next gap; flap upward to adjust height and safely pass through the pipe gap...

CONTROL
flap() → press_key(key=Space, duration=0.2)

EVAL
task_eval.progress: 0 → 0.1;
game_state.score: 0 → 1;

Progress: 10.0% PLAYING

STEP 50

THINK
Bird is in a safe position between pipe gaps; wait to let it continue moving forward while maintaining height...

CONTROL
wait() → wait(duration=0.2)

EVAL
game_state.player.y: 185.38 → 179.4;
game_state.player.vy: -2.66 → 1.94

Progress: 10.0% PLAYING

DESCRIPTION

This case shows the difficulty in real-time control: a slightly early or late flap determines whether any progress can be earned from visually similar states.

Figure 8 Case-study visualization for real-time reaction and timing control. This Flappy Bird window shows how a visually small timing error can still be mechanically decisive under tight control constraints.

5.2 Long-Horizon Simulation: Minecraft Resource Collection

Figure 7 shows an open-ended trajectory of Minecraft-Clone-GLM, in which the agent repeatedly mines the resource toward the target number. The failure is not instruction-following but missing closure: the run reaches 90% progress yet still fails to finish the collection target within the step limit.

5.3 Real-Time Reaction and Timing Control: Flappy Bird

As shown in Figure 8, the third case uses a short Flappy-Bird interaction sequence. The consecutive frames look nearly identical, while the correct action alternates between waiting and flapping. This highlights the real-time control difficulty of video games: a slightly early or late flap determines whether any progress should be inferred from visually similar states.

6 Related Work

6.1 Computer-Use Benchmarks with Online Environments

Computer-use has emerged as a major direction in the recent rise of digital agents and serves as an important testbed for advancing agent capabilities. Its core challenge lies in designing a standardized action space and interactive environment that allow agents to flexibly control complex interfaces and operations. The design of these benchmarks largely determines how effectively agents can assist with human computer tasks and whether they can support increasingly complex and diverse workflows. WebArena [86] and OSWorld [75] establish strong templates for agent benchmarking in browser and desktop environments, highlighting the importance of outcome-based evaluation. Cradle [63] and early studies of computer-use agents [21, 25, 34, 82] further demonstrate that foundation models can operate general-purpose GUIs, even professional softwares or complex games. OSWorld-MCP [38] extends this line by highlighting fairness issues in hybrid action pathways and tool-use decision quality. These computer-use benchmarks provide important guidance for making agent evaluation more standardized and scalable. GameWorld transfers these insights to game-agent evaluation through interactive environments, parallel instances, and outcome-based, state-verifiable evaluation.

6.2 Video Game Benchmarks for LLM and MLLM Agents

Game environments have long served as AI testbeds [10, 11, 16, 77]. In open-ended vision-centric games, early work focuses on training and agent construction: MineDojo [23] supplies internet-scale knowledge for Minecraft, VPT [9] learns behavioral priors from unlabeled gameplay video, Steve-1 [42] generates text-conditioned behaviors, JARVIS-1 [71] adds multimodal memory for long-horizon Minecraft tasks, See and Think [81] combines vision, language instruction, and code actions in Minecraft, and Voyager [69] demonstrates lifelong skill acquisition through LLM planning. More recent work also uses gameplay itself as a learning signal rather than only an evaluation target: Game-RL [68] synthesizes verifiable game tasks for RL, while Play to Generalize [76] shows that post-training on arcade-style gameplay can transfer to broader multimodal reasoning benchmarks. As models grow more capable, the bottleneck shifts from training to reliable evaluation. MCU [84] scales open-ended Minecraft evaluation through compositional atomic tasks and human-aligned assessment. LMGame-Bench [33] exposes prompt sensitivity by modularly toggling perception, memory, and reasoning. BALROG [51], LVLM-Playground [70], and V-MAGE [83] stress long-horizon, structured, or vision-centric reasoning in interactive games. VideoGameBench [79] shows that inference latency dominates real-time failure and introduces a paused track to isolate it. FlashAdventure [3] focuses on full-story-arc completion in 34 Flash adventure games and introduces CUA-as-a-Judge for automated milestone verification. Orak [53] provides a fine-tuning pipeline with held-out cross-game transfer studies. Specialized benchmarks further target collaboration or downstream workflows: Collab-Overcooked [61] evaluates language-mediated multi-agent coordination, while VideoGameQA-Bench [62] measures game QA tasks such as visual regression, glitch detection, and bug-report generation. Notably, concurrent work GameVerse [80] shares our core idea of combining semantic and GUI control in a dual action space, and further introduces a reflect-and-retry protocol based on failure trajectories and tutorials. However, it still faces challenges in heuristic evaluation with using VLMs to quantify progress.

6.3 Game Agents and Scalable Infrastructure

Generalist agents can already accomplish a wide range of tasks in digital worlds. To move toward real-world embodied agents, researchers increasingly test agent capabilities in games and simulated environments. A parallel line of work builds generalist agents that operate across many games. Game-TARS [72] anchors all actions to native keyboard-mouse inputs, while Jarvis-VLA [41] shows that large vision-language models can be post-trained to act directly through the same interface. WebGym [7] scales training environments to 300K realistic web tasks, showing that environment diversity directly improves out-of-distribution agent performance. NitroGen [43] extracts action labels from internet-scale gameplay video and wraps games with a universal Gym-style API [12]. Lumine [64] unifies perception, reasoning, and action in one vision-language model that transfers across 3D open worlds with game-specific fine-tuning. The SIMA project evolves from instruction-following across many simulated worlds in SIMA [59] to the richer interactive-partner setting of SIMA 2 [60], with held-out-environment evaluation and broader user interaction. As these game agents of different interfaces mature, standardized benchmarks become essential for measuring their performance. To this end, GameWorld provides a comprehensive benchmark with diverse games and tasks under a unified and verifiable evaluation protocol.

7 Conclusion

Discussion. Our results show that current multimodal game agents can often make partial progress, yet still struggle to convert that progress into reliable task completion across diverse browser games. Under one shared runtime and verifier, GameWorld further exposes interface-conditioned weaknesses in real-time interaction, context-memory sensitivity, and action validity. These findings suggest that stronger game agents will require not only better reasoning, but also more reliable action grounding, more useful trajectory memory, and greater robustness to latency. We hope GameWorld serves as a reproducible benchmark for measuring such progress under standardized, outcome-based state-verifiable evaluation.

Limitations and Future Work. The benchmark necessitates designing unique instruction sets for each new environment, which tightly couples the action space to the task and constrains the model’s scalability. Automating the producing and alignment process of Semantic Action Parsing through MLLM-powered agent exploration is left for future work. Further discussions, including the guideline on licensing and compliance of our benchmark and the cost summary, are detailed in Appendix F.

Conclusion. GameWorld provides a standardized and verifiable benchmark for evaluating multimodal game agents in browser environments. Across 34 games, 170 tasks, and 18 model–interface pairs, our results show that current agents can often make meaningful partial progress yet remain far from reliable task completion and human-level performance. Together with robustness, real-time, context-memory, and action-validity analyses, these findings establish GameWorld as a reproducible foundation for studying multimodal agents in complex, open-ended interactive environments.

Appendix

A Benchmark Runtime

A.1 Preset Configuration

Each standalone benchmark run is launched using a registry preset passed via the command-line flag -config. The preset has the form:

<game_id>+<task_id>+<model_spec>

where the three components are resolved independently and then composed into a concrete runtime configuration:

• <game_id>: contributes game rules, role definitions, low-level control constraints, and semantic action definitions.
• <task_id>: contributes the task instruction, evaluator configuration, target metrics, maximum step budget, and optional URL suffixes.
• <model_spec>: contributes the model identifier(s), provider-specific overrides, prompt template, and output-format prompts.

This decomposition keeps game definitions, tasks, and model profiles reusable and flexible: changing a task or swapping models does not require duplicating the underlying game configuration.

A.2 Suite Runner

The suite runner reads a suite YAML file and expands each benchmark case into explicit runs by enumerating the specified games, tasks, and models. The resulting preset for each child run reuses the same syntax as standalone execution in Section A.1. Expanded runs are grouped into repeat waves. Within each wave, the runner launches up to -max-parallel child processes in parallel. Each child process invokes a standalone benchmark run with a dedicated port, a unique session ID, and an isolated run directory that stores logs and task-evaluation outputs. After all runs finish, the runner writes per-model summary files that support both interactive monitoring and subsequent aggregate analysis.

B Observation-Action-Evaluation Loop

B.1 Runtime Coordinator

The benchmark loop is coordinated by a Runtime Coordinator. Each MLLM is wrapped in an Agent object with an agent ID, model type, model client, role-specific controls, a semantic-control map, and a step counter. The runtime also instantiates a GameEnv for the browser environment and the playable game, together with an Evaluator for task-progress tracking and evaluation.

At a high level, each interaction round executes the following sequence:

1. Capture a screenshot of the current game environment.
1. Optionally pause the game during model inference.
1. Get the raw response from the model using the current screenshot and the assembled prompt template.
1. Resume the game if paused.
1. Parse the raw model output into an executable action payload.
1. Execute the action in the browser environment.
1. Capture a verifiable game-state snapshot.
1. Evaluate task progress and check stopping or resetting conditions.

B.2 Evaluator and Reset-on-Fail

Evaluator. The evaluator receives the current state snapshot, global step index, maximum step budget, target threshold, and accumulated metrics. It combines four stop or reset signals:

• terminal status from state.terminal,
• exhaustion of the fixed step budget,
• reaching the task target score, and
• any task-specific end-field rule.

Reset-on-Fail. If continue_on_fail is enabled and the game reports a terminal failure, the runtime does not end the run immediately. Instead, it calls gameAPI.reset() to reset the task, allocates a new episode ID, waits for the game to become initialized again, and continues under the same global step budget.

C Browser Sandbox and Game API

C.1 Browser Management

The GameLauncher starts a local HTTP server and serves the game HTML. The browser manager then launches Chromium with a fixed viewport and disables common background-throttling behaviors. It also injects a dynamic speed-control script and a deterministic-randomness script by overriding JavaScript headers. Screenshots are captured through the Chrome DevTools Protocol rather than through a standard page-screenshot call to avoid visible flashing in headed mode. The browser-environment initialization process includes:

• Start the environment and open the game in Chromium.
• Wait until the game becomes actionable under the readiness gate described in Section C.2.

C.2 Readiness Gate

Before the first agent action and after every reset, the runtime waits until the game reports an actionable state. The default actionable statuses are ready and playing. Table 12 lists the common status values consumed by the readiness gate and the evaluator.

Table 12 Status values consumed by the readiness gate and evaluator.

Status	Runtime description
`loading`	Assets or engine bootstrap are not yet ready for model control.
`menu`	The game is initialized but still in a pre-play state such as title screen, level select, or pause menu.
`ready`	The game is initialized and ready to begin, but may still require one trusted start action.
`playing`	The gameplay loop is active and safe for model control.
`paused`	The game is temporarily paused; this status is used by the sandbox pause mechanism during model inference.
`terminal`	The current in-game episode has ended.

C.3 Verifiable State: Game API Schema

To enable verifiable evaluation, every benchmark game is required to expose a serializable window.gameAPI with three callable methods: init(config), reset(options), and getState(). The returned state contains a game ID, a timestamp, lifecycle status, terminal metadata, a structured game_state object, task metrics, and raw game-specific details. This serves as the verifiable game state for the outcome-based evaluation.

Here we provide an example Game API schema from 17_mario-game. The exact game-specific fields can vary, but the top-level contract remains the same across all games in the GameWorld benchmark.```

{ "gameId": "17_mario-game", "seed": 42, "timestampMs": 1760001234567, "gameTimeMs": 18420, "status": "playing", "terminal": { "isTerminal": false, "outcome": null, "reason": null }, "game_state": { "score": 3200, "level": "1-1", "progress": 0.37, "player": { "x": 128, "y": 80, "vx": 0, "vy": 0, "power": 1, "alive": true, "name": "Mario" }, "board": null, "entities": null }, "metrics": { "lives": 3, "coins": 8, "distance": 3200, "attempts": 1, "time_left_s": 999, "enemies_alive": 5, "level_progress_percent": 42 }, "raw": { "world": 1, "stage": 1, "levelId": "1-1", "worldDisplay": "1-1", "coins": 8, "coinsCollected": 8, "lives": 3, "timeLeft": 999, "mapTime": 999, "paused": false, "playerPower": 1, "playerName": "Mario", "levelProgress": 0.37, "levelProgressPercent": 37 } }


This example illustrates the benchmark convention: `game_state` stores structured in-game state for evaluation, `metrics` stores compact comparable counters, and `raw` preserves optional game-specific details for further analysis.

## D Agent Details

### D.1 Rolling Memory

The base client supports a rolling memory store. The memory module records each interaction round in the fixed order: `user_prompt` → `screenshot` → `reasoning` → `action`. At inference time, the client reinjects a filtered slice of the most recent rounds according to `memory_rounds`, `memory_format`, and `memory_include_fields`. Text entries are inserted under an `Action History` header, while screenshot entries are reattached as multimodal image items, so the resulting memory context is an interleaved multimodal history rather than a pure text block.

## D.2 Low-Level Action and Validation

**Low-Level Action Normalization.** All executable actions are normalized before they reach Playwright. The runtime-facing normalized action schema handled by the executor consists of:

- • mouse actions: `click`, `click_hold`, `drag`, `mouse_move`, `scroll`;
- • keyboard actions: `type`, `press_key`, `press_keys`;
- • timing action: `wait`.

Table 1 defines the conceptual unified control space in terms of atomic events. The current implementation introduces a slightly higher-level normalized runtime action layer above that space for parser compatibility, legality checks, and one-action-per-step execution. Variants such as `left_click`, `right_click`, and `left_click_hold` are first normalized into this shared schema. During execution, the normalized actions are then translated into Playwright mouse and keyboard primitives such as button down/up, key down/up, wheel, move, and type events. This extra layer is also pragmatic because Playwright itself already exposes several higher-level interaction primitives. Thus the atomic event space remains the execution-layer semantics, while the runtime interface used inside the repository is slightly higher-level.

**Action Legality Validation.** Legality is role-aware and strictly configured by the role definition. For generalist agents, the runtime coordinator first checks whether the proposed semantic control resolves against the registered semantic-control map, and then the executor checks whether the mapped low-level action satisfies the current role controls. For computer-use agents, the executor validates the proposed low-level action directly. The role controls come from the registry definition, including `allowed_keys`, `allow_clicks`, `hold_duration`, and `key_durations`. Invalid tool calls or disallowed low-level actions are logged as invalid and ignored at execution time. Key aliases are normalized (for example, `left` to `ArrowLeft`), and the executor can fall back to naming conventions when the requested key is not legal but has a semantically equivalent allowed key. This makes the action interface more robust without expanding the legal action space beyond the role definition.

## D.3 Semantic Action Parsing

Generalist agents do not emit raw keyboard or mouse actions directly. Instead, they emit a semantic control payload whose control identifier can arrive as `action`, `tool_name`, or `tool_id`, together with runtime arguments. The runtime resolves this identifier through a registry-built semantic-control map with case-insensitive and alias-aware lookup, merges the runtime arguments into the YAML-defined binding, and applies optional `cell_bindings` when a semantic cell reference should expand into coordinates. The mapped result then enters the same low-level execution chain as computer-use actions. Unknown control identifiers are marked invalid and reduce to a no-op at execution time.

## E Prompt Templates and Game Prompt Blocks

The runtime assembles the final model prompt from four pieces into a shared template: a game-level rules block, a role-and-controls block, a task-specific instruction block, and a model-specific output-format block.

### E.1 Prompt Assembly: Shared Templates

Prompt assembly is driven by the prompt templates. The stock templates for generalist agents and computer-use agents share a fixed section order:

1. 1.

# Game Rules
2. 2.

# Role and Controls
3. 3.

# Task Instruction
4. 4.

# Output FormatFor computer-use agents, the role block combines the role description with an explicit textual control specification. For generalist agents, the role block combines the role description with an automatically rendered semantic action list. This list is built directly from the registry `semantic_controls` entries and therefore stays synchronized with the executable action space.

### E.1.1 Generalist Agent Template

Below is the prompt template for generalist agents.

You are an expert game agent specialized in playing video games. Your goal is to play the game and achieve the task goal. Observe the current game screen to identify your character and key objects. Your action must follow the game rules and the instructions of the current task. Execute the actions frame-by-frame. You do NOT have direct access to keyboard or mouse actions. You must act through the registered semantic control list only.

{% if game_rules_block %}

Game Rules

{{ game_rules_block }} {% endif %}

{% if role_control_block_semantic %}

Role and Controls

{{ role_control_block_semantic }} {% endif %}

{% if task_instruction_block %}

Task Instruction

{{ task_instruction_block }} {% endif %}

{% if model_output_format_block %}

Output Format

{{ model_output_format_block }} {% endif %}


### E.1.2 Computer-Use Agent Template

Below is the prompt template for computer-use agents.

{% if game_rules_block %}

Game Rules

{{ game_rules_block }} {% endif %}

{% if role_control_block_computer_use %}

Role and Controls

{{ role_control_block_computer_use }} {% endif %}

{% if task_instruction_block %}

Task Instruction

{{ task_instruction_block }} {% endif %}

{% if model_output_format_block %}

Output Format

{{ model_output_format_block }} {% endif %}


## E.2 Per-Game Prompt Library

For each benchmark game below, we include the exact game-rules block together with the role prompt blocks loaded from the registry. For each role, we include the shared role prompt text, the computer-use controls prompt text, and the semantic action list rendered for generalist agents from the registered `semantic_controls` entries. We also list the five benchmark task prompts for that game.

### 1-2048 (2048).

Game Rules Prompt.You are playing 2048, a sliding tile puzzle game.

Game Objective.

- - Combine matching tiles to create higher values.

Game Rules.

- - All tiles slide in the direction you press.
- - When two tiles with the same number collide, they merge into one tile with doubled value.
- - After each valid move (any tile is moved), a new tile of 2 or 4 appears in a random empty cell.
- - Game ends when no more moves are possible (board is full and no merges available).

#### Role 0 (player) Game Agent Role Prompt.

You control the 2048 board. Choose exactly one action per step to slide tiles.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing.
- - ArrowUp: Slide all tiles up
- - ArrowDown: Slide all tiles down
- - ArrowLeft: Slide all tiles left
- - ArrowRight: Slide all tiles right

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Do nothing.
- - move\_up: Slide all tiles up.
- - move\_down: Slide all tiles down.
- - move\_left: Slide all tiles left.
- - move\_right: Slide all tiles right.

### 2-another-gentlemans-adventure (Another Gentleman's Adventure).

#### Game Rules Prompt.

You are playing Another Gentleman's Adventure, a platformer game.

Game Objective.

- - Navigate through levels, avoiding obstacles and enemies to get coins.

Game Rules.

- - Move left and right to navigate platforms.
- - Jump (left or right) to avoid obstacles and reach higher platforms.
- - Jump from below to hit the Golden Question Blocks and reveal items.
- - Jumping on enemies kills them.

#### Role 0 (player) Game Agent Role Prompt.

You control the gentleman character.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Stay still.
- - Arrow Left/Right: Move horizontally
- - Arrow Up: Jump
- - Arrow Left/Right + Arrow Up (key combination): Jump while moving left/rights

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Stay still briefly.
- - move\_left: Walk left.
- - move\_right: Walk right.
- - jump: Jump vertically.
- - jump\_left: Jump while moving left.
- - jump\_right: Jump while moving right.

### 3-astray (Astray).

#### Game Rules Prompt.

You are playing Astray, a maze navigation game.

Game Objective.

- - Navigate the ball through the maze to reach the exit.

Game Rules.- - Navigate through the maze carefully to find the path to the exit.
- - The exit is typically at the right-top far corner of the maze.

#### Role 0 (player) Game Agent Role Prompt.

You control the ball in the maze. Use arrow keys to navigate to the exit.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Wait briefly without moving
- - ArrowUp: Move the ball up
- - ArrowDown: Move the ball down
- - ArrowLeft: Move the ball left
- - ArrowRight: Move the ball right

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Wait briefly without moving.
- - move\_left: Move the ball left.
- - move\_right: Move the ball right.
- - move\_up: Move the ball up.
- - move\_down: Move the ball down.

### 4-boxel-rebound (Boxel Rebound).

#### Game Rules Prompt.

You are playing Boxel Rebound, a precision platformer.

Game Objective.

- - Time your jumps to navigate through each level to reach the end without dying.
- - Your character automatically moves forward continuously.

Game Rules.

- - You can only jump - no left/right movement control.
- - Avoid spikes and falling into pits.

#### Role 0 (player) Game Agent Role Prompt.

You control the boxel character. Time your jumps to navigate through the level.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing to continue movement.
- - Space or ArrowUp: Jump

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Do nothing to continue movement.
- - jump: Jump to avoid obstacles or cross gaps.

### 5-breakout (Breakout).

#### Game Rules Prompt.

You are playing Breakout, a classic brick-breaking game.

Game Objective.

- - Break all the bricks to complete each level. Each brick broken adds to your score.

Game Rules.

- - Use the paddle to bounce the ball upward.
- - Break bricks by hitting them with the ball.
- - Don't let the ball fall past the paddle.

#### Role 0 (player) Game Agent Role Prompt.

You control the paddle. Move left and right to keep the ball in play and break bricks.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Stay still.- - Arrow Left/Right: Move paddle
- - Space: Launch ball/Start game

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Stay still briefly.
- - move\_left: Move paddle left.
- - move\_right: Move paddle right.

### 6-captaincallisto (Captain Callisto).

#### Game Rules Prompt.

You are playing Captain Callisto, a platform adventure.

Game Objective.

- - Reach the exit of each area.
- - Collect coins along the way.

Game Rules.

- - Move left/right/up/down and jump across platforms.
- - Use the jetpack to fly up after you collected the fuels.
- - Kill the enemies by jumping on them.
- - When you touch the enemies (not from above) or falling, you will respawn.
- - You need a move to start the level after respawning.

#### Role 0 (player) Game Agent Role Prompt.

You control Captain Callisto. Control the player to move, jump, and reach the exit.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing.
- - ArrowLeft / ArrowRight or A/D: move left/right
- - ArrowUp / ArrowDown or W/S: climb or move vertically
- - Space: jump
- - Shift: jetpack
- - r: restart

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Do nothing.
- - move\_left: Move left.
- - move\_right: Move right.
- - move\_up: Move up.
- - move\_down: Move down.
- - jump: Jump over obstacles or gaps.
- - jetpack: Use the jetpack to fly up (only if available)
- - restart: Restart the level. (You will need a move to start the level after respawning)

### 7-chrome-dino (Chrome Dino).

#### Game Rules Prompt.

You are playing the Chrome Dino runner game. Jump to start the run.

Game Objective.

- - Survive as long as possible while the dino runs.
- - Avoid obstacles to keep the run going and increase score.

#### Role 0 (player) Game Agent Role Prompt.

You control the dinosaur runner. Jump to avoid obstacles and survive.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing to keep the run going.
- - Space/ArrowUp: Jump or start the run

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:- - wait: Do nothing briefly to keep the run going.
- - jump: Jump over obstacles.
- - start: Start or restart the run.

## 8-core-ball (Core Ball).

### Game Rules Prompt.

You are playing Core Ball, a high-speed 2D timing-and-precision game. Shoot numbered balls into a rotating core and attach them to it without collisions.

#### Game Objective.

- - Attach all balls for the current level to the core.
- - Complete the level before the queue is exhausted and the core becomes too crowded.
- - Levels get faster and denser as you progress.

#### Game Rules.

- - Press Space to shoot one ball toward the center.
- - A ball collision with another ball fails the run immediately.

### Role 0 (player) Game Agent Role Prompt.

You control precision ball launches toward the spinning core. Time each shot carefully and keep the path clear.

### Role 0 (player) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing to wait the core to rotate
- - Space: shoot one ball

### Role 0 (player) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Pause briefly and observe target alignment.
- - shoot: Shoot one ball toward the core.

## 9-cubefield (Cubefield).

### Game Rules Prompt.

You are playing Cubefield, a 3D endless runner where you navigate a triangular ship through a field of cubes.

#### Game Objective.

- - Survive as long as possible by avoiding cube obstacles.

#### Game Rules.

- - Move left and right to dodge cubes.
- - The ship automatically moves forward.
- - Hitting any cube ends the game instantly.

### Role 0 (player) Game Agent Role Prompt.

You control the triangular ship. Move left or right to navigate through the cube field.

### Role 0 (player) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Continue straight.
- - Arrow Left: Move left
- - Arrow Right: Move right

### Role 0 (player) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Continue straight.
- - move\_left: Move left to avoid cubes.
- - move\_right: Move right to avoid cubes.

## 10-doodle-jump (Doodle Jump).

### Game Rules Prompt.

You are playing Doodle Jump, a vertical platformer.

#### Game Objective.

- - Keep ascending by landing on higher platforms.

#### Game Rules.

- - Green platforms: normal platforms.
- - Blue movable platforms: moving left and right.
- - Red breakable platforms: if you step on it, it will break and you will fall.
- - White vanishable platforms: can step on it once, it will vanish.
- - The left and right sides wrap around each other.

#### Role 0 (player) Game Agent Role Prompt.

You control Doodle. Control the player to drift and avoid falling.

#### Role 0 (player) Computer-Use Controls Prompt.

##### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Do nothing to continue vertical movement.
- - ArrowLeft / ArrowRight: move horizontally.

#### Role 0 (player) Generalist Semantic Action List.

##### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Do nothing to continue vertical movement.
- - move\_left: Drift left.
- - move\_right: Drift right.

### 11-edge-surf (Edge Surf).

#### Game Rules Prompt.

You are playing Edge Surf, an endless surfing game. You control a surfer riding waves and collecting items.

##### Game Objective.

- - Survive as long as possible while surfing.
- - Achieve the highest score and distance with 3 lives.

##### Game Rules.

- - You can only turn left or right, slow down or boost the surfer.
- - Hitting obstacles costs lives.
- - Collecting boosts gives temporary speed, and shields gives temporary invincibility.

#### Role 0 (player) Game Agent Role Prompt.

You control the surfer character. Turn left and right to avoid obstacles and collect items.

#### Role 0 (player) Computer-Use Controls Prompt.

##### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Wait: Continue surfing in the same direction without action.
- - Arrow Left: Turn left
- - Arrow Right: Turn right
- - Arrow Up: Slow down the surfer
- - Arrow Down: Boost the surfer (requires a boost item)

#### Role 0 (player) Generalist Semantic Action List.

##### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Continue surfing in the same direction.
- - turn\_left: Turn left.
- - turn\_right: Turn right.
- - slow\_down: Slow down the surfer.
- - boost: Boost the surfer (requires a boost item).

### 12-fireboy-and-watergirl (Fireboy and Watergirl).

#### Game Rules Prompt.

You are playing 'Fireboy and Watergirl in The Forest Temple'. Some puzzles require both characters to cooperate to solve.

##### Game Objective.

- - Both characters must reach their respective exit doors to complete the level.
- - Collect all diamonds for a higher score.

##### Game Rules.

- - Green toxic liquid: Kills BOTH characters
- - Red lava pools: Kills Watergirl only
- - Blue water pools: Kills Fireboy only
- - Buttons and levers control platforms and doors
- - Boxes can be pushed to reach higher platforms

### Role 0 (watergirl) Game Agent Role Prompt.

You are Agent 0 controlling the Watergirl character (blue girl).

#### YOUR OBJECTIVES:

- - Collect blue diamonds (Watergirl's gems)
- - Reach the blue exit door
- - NEVER touch red lava (instant death)

### Role 0 (watergirl) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - W: Jump
- - A: Move left
- - D: Move right
- - W + A: Jump left
- - W + D: Jump right

### Role 0 (watergirl) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Stay still briefly.
- - move\_left: Walk left (hold).
- - move\_right: Walk right (hold).
- - jump\_left: Jump diagonally left (jump + left).
- - jump\_right: Jump diagonally right (jump + right).

### Role 1 (fireboy) Game Agent Role Prompt.

You are Agent 1 controlling the Fireboy character (red boy).

#### YOUR OBJECTIVES:

- - Collect red diamonds (Fireboy's gems)
- - Reach the red exit door
- - NEVER touch blue water (instant death)

### Role 1 (fireboy) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowUp: Jump
- - ArrowLeft: Move left
- - ArrowRight: Move right
- - ArrowUp + ArrowLeft: Jump left
- - ArrowUp + ArrowRight: Jump right

### Role 1 (fireboy) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Stay still briefly.
- - move\_left: Walk left (hold).
- - move\_right: Walk right (hold).
- - jump\_left: Jump diagonally left (jump + left).
- - jump\_right: Jump diagonally right (jump + right).

## 13-flappy-bird (Flappy Bird).

### Game Rules Prompt.

You are playing Flappy Bird, a one-button flying game.

#### Game Objective.

- - Keep the bird flying between pipes for as long as possible.
- - Pass pipes to increase your score.

#### Game Rules.

- - Press Space to flap upward; gravity pulls you down.
- - Hitting pipes or the ground ends the run.

### Role 0 (player) Game Agent Role Prompt.

You control the bird. Tap to flap and avoid the pipes.

### Role 0 (player) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Space: Flap upward
- - Wait: Do nothing briefly to let gravity pull bird downward

### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).
Choose exactly one action per step:

wait: Do nothing briefly to let gravity pull bird downward
flap: Flap upward.


### 14-geodash (GeoDash).

#### Game Rules Prompt.

You are playing GeoDash, a Geometry Dash-style auto-running platformer.
Game Objective.

Survive for as long as possible.
Jump over spikes and hazards.
Game Rules.
The character auto-runs continuously.
Jump timing is the core mechanic.
Crashing into obstacles ends the run.


#### Role 0 (player) Game Agent Role Prompt.

You control the runner. Time jumps to avoid spikes and run as long as possible.


#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

Space


#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).
Choose exactly one action per step:

wait: Do nothing briefly to keep the direction.
jump: Jump.


### 15-google-snake (Google Snake).

#### Game Rules Prompt.

You are playing Google Snake, the classic snake game.
Game Objective.

Eat apples to grow longer and increase score.
Avoid crashing into walls or your own body.
Game Rules.
The snake moves continuously in the chosen direction.
You can turn up, down, left, or right.
The game ends when you hit a wall or your body.


#### Role 0 (player) Game Agent Role Prompt.

You control the snake. Use direction keys to guide it toward apples safely.


#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

ArrowUP: Turn upward
ArrowDown: Turn downward
ArrowLeft: Turn left
ArrowRight: Turn right
Wait: Do nothing briefly to keep the direction


#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).
Choose exactly one action per step:

wait: Do nothing briefly to keep the direction.
move_up: Turn upward.
move_down: Turn downward.
move_left: Turn left.
move_right: Turn right.


### 16-hextris (Hextris).

#### Game Rules Prompt.You are playing Hextris, a fast-paced hexagon-based puzzle game.

Game Objective.

- - Matching Mechanism: Group 3 or more blocks of the same color on any of the six sides to clear them and get scores.

- - Prevent blocks from stacking outside the outer boundary of the central hexagon.

Game Rules.

- - The game runs continuously and blocks fall from the edges into the central hexagon.

- - The speed of the falling blocks increases over time.

- - Left/Right arrows rotate the hexagon in opposite directions.

- - Press Down to speed up falling block motion.

#### Role 0 (player) Game Agent Role Prompt.

You control the rotating hexagon. Keep the board manageable by rotating quickly and balancing incoming blocks.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowLeft: rotate hexagon left
- - ArrowRight: rotate hexagon right
- - ArrowDown: accelerate falling blocks

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Briefly wait while the board evolves.
- - rotate\_left: Rotate the hexagon left once.
- - rotate\_right: Rotate the hexagon right once.
- - accelerate: Temporarily speed up falling blocks.

### 17-mario-game (Mario Game).

#### Game Rules Prompt.

You are playing Super Mario Bros. Control Mario through platforming levels, collecting coins and defeating enemies.

Game Objective.

- - Reach the flagpole at the end of each level.
- - Collect coins from question blocks or kill enemies for points.

Game Rules.

- - You have 3 lives and limited time per level.

- - Touching enemies from the side or below kills Mario. Jump on enemies to kill enemies.

- - Jumping underneath question blocks will reveal coins or power-ups.

- - Power-ups: Mushroom make Mario grow big.

#### Role 0 (player) Game Agent Role Prompt.

You control Mario. Navigate through platforming levels by running, jumping, and avoiding hazards.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowLeft: Move left
- - ArrowRight: Move right
- - ArrowUp: Jump
- - ArrowDown: Duck/crouch (when big)
- - ArrowUp + ArrowLeft/ArrowRight: Jump while moving left/right

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Stand still and wait.
- - move\_right: Walk right.
- - move\_left: Walk left.
- - jump: Jump straight up.
- - jump\_right: Jump while moving right.
- - jump\_left: Jump while moving left.
- - duck: Duck/crouch (when big).

### 18-minecraft-clone-glm (Minecraft Clone).

#### Game Rules Prompt.

You are playing a Minecraft-style first-person sandbox survival game.

Game Objective.- - Gather resources by breaking blocks.
- - Move around and mine nearby blocks efficiently.

Game Rules.

- - A short left click places the currently selected block.
- - Hold the left mouse button on the targeted block to mine it. Hold-mining is accelerated in benchmark mode.
- - You can also place blocks with the right mouse button.
- - Move around and jump to navigate the terrain.
- - Look around to aim at different blocks.
- - Select different hotbar slots to switch items.

### Role 0 (player) Game Agent Role Prompt.

You control the player character. Prioritize efficient nearby resource collection by moving to blocks and mining them.

### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowUp: move forward
- - ArrowDown: move backward
- - ArrowLeft: strafe left
- - ArrowRight: strafe right
- - Space: jump
- - Mouse drag: look around / turn camera (drag from center toward a direction)
- - Left click: place the selected block
- - Left click and hold: mine the targeted block
- - Right mouse: place selected block
- - 1-9: select hotbar slot

### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Wait briefly.
- - move\_forward: Move forward briefly.
- - move\_backward: Move backward briefly.
- - strafe\_left: Move left briefly.
- - strafe\_right: Move right briefly.
- - jump: Jump once.
- - look\_left: Turn camera left.
- - look\_right: Turn camera right.
- - look\_up: Tilt camera up.
- - look\_down: Tilt camera down.
- - mine\_target: Hold left click at center to mine the targeted block.
- - place\_block: Place the selected block at center.
- - select\_slot\_1: Select hotbar slot 1.
- - select\_slot\_2: Select hotbar slot 2.
- - select\_slot\_3: Select hotbar slot 3.

## 19-minesweeper (Minesweeper).

### Game Rules Prompt.

You are playing Minesweeper, a logic puzzle.

Game Objective.

- - Flag all mines.

Game Rules.

- - Left click a cell reveals it; right click flags a mine.
- - A number shows how many mines are adjacent to that cell.
- - Clicking a mine ends the game.

### Role 0 (player) Game Agent Role Prompt.

You control the board to reveal safe cells and avoid mines.

### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Mouse left click: Reveal a cell
- - Mouse right click: Flag a mine

### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Pause briefly.
- - reveal\_cell: Reveal a cell by id (cell="a1"..i9"). (required: cell)
- - flag\_cell: Flag a cell by id (cell="a1"..i9"). (required: cell)

## 20-monkey-mart (Monkey Mart).

## Game Rules Prompt.

You are playing Monkey Mart, a store management game.

### Game Objective.

- - Stock shelves, serve customers, and earn money.
- - Expand the store by unlocking new stations and upgrades.

### Game Rules.

- - Move within the store to automatically harvest, carry, and restock items.
- - Stay near the banana tree to harvest bananas, the corn field to harvest corn, the corresponding shelves to stock items, and most importantly, the counter to serve customers and collect green money.
- - The needs of customers will pop up at the top of them, and you need to have the item they want stocked on the shelves for them to buy it.
- - Customers take items and pay at the counter.
- - Stay at the left of the counter to serve customers and then collect green money.
- - Banana tree is at the bottom of the store, banana shelf is in the middle, corn field is at the bottom left, corn shelf is at the top of the corn field, and the counter is on the left of the banana shelf.
- - You have one assistant with orange hat to collect and restock corns, but they will occasionally be idle, you need to wake them.
- - Assistant will only help collect and restock corns. Therefore, you need to collect and restock bananas, and collect green money by yourself.
- - You should interleave between harvesting, stocking, and serving to keep the store running efficiently. For example, if you keep harvesting and restocking, but never serve customers at the counter, you won't collect any money. On the other hand, if you only stay at the counter to serve customers without restocking, the shelves will run out of stock and customers will wait for restocking.
- - Use earnings to unlock new areas and assistants.

## Role 0 (player) Game Agent Role Prompt.

You control the monkey character. Move around the store to stock and manage items.

## Role 0 (player) Computer-Use Controls Prompt.

### ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowUp: Move up
- - ArrowDown: Move down
- - ArrowLeft: Move left
- - ArrowRight: Move right
- - wait: Stay still to wait for assistants and customers

## Role 0 (player) Generalist Semantic Action List.

### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Stay still to wait for assistants and customers.
- - move\_up: Move up.
- - move\_down: Move down.
- - move\_left: Move left.
- - move\_right: Move right.

## 21-ns-shaft (NS-Shaft).

### Game Rules Prompt.

You are playing NS-Shaft, a falling platform game.

### Game Objective.

- - Descend as far as possible by landing on platforms.
- - Avoid hazards while keeping your character alive.

### Game Rules.

- - The character falls downward automatically.
- - Move left/right to land on platforms.
- - Life decreases when you touch the pillars.
- - You will die if you fall all the way to the bottom or run out of the life.

## Role 0 (player) Game Agent Role Prompt.

You control the falling character. Move left and right to land on platforms safely.

## Role 0 (player) Computer-Use Controls Prompt.

### ACTION SPACE (ONLY LEGAL ACTIONS):

- - Arrow Left/Right: Move left/right

## Role 0 (player) Generalist Semantic Action List.

### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Do nothing briefly.
- - move\_left: Move left.
- - move\_right: Move right.

## 22-ovo (OVO).

### Game Rules Prompt.

You are playing OVO, a fast platformer with traps.

#### Game Objective.

- - Reach the exit of each level.
- - Collect coins while avoiding hazards.

#### Game Rules.

- - Avoid traps and pits.
- - When you go next to the wall, you can jump higher.
- - Jump on place and press down to smash the ground.

### Role 0 (player) Game Agent Role Prompt.

You control the character in OVO. Control the player to run and jump through levels.

### Role 0 (player) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowLeft / ArrowRight: move left/right.
- - ArrowUp: jump.
- - ArrowUp + ArrowRight / ArrowLeft: jump while moving right / left.
- - ArrowDown: smash the ground.
- - ArrowDown + ArrowRight / ArrowLeft: slide right / left.

### Role 0 (player) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Pause briefly.
- - move\_left: Run left.
- - move\_right: Run right.
- - jump: Jump.
- - jump\_right: Jump while moving right.
- - jump\_left: Jump while moving left.
- - smash\_ground: Smash the ground.
- - slide\_right: Slide right.
- - slide\_left: Slide left.

## 23-pacman (Pac-Man).

### Game Rules Prompt.

You are playing Pac-Man, a maze chase game.

#### Game Objective.

- - Eat all pellets in the maze.
- - Avoid ghosts, or eat them after grabbing a power pellet.

#### TIPS:

- - Use corridors to bait ghosts into corners.
- - Save power pellets for tricky sections.

### Role 0 (player) Game Agent Role Prompt.

You control Pac-Man. Control the player to move through the maze.

### Role 0 (player) Computer-Use Controls Prompt.

#### ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowUp: Move up through the maze
- - ArrowDown: Move down through the maze
- - ArrowLeft: Move left through the maze
- - ArrowRight: Move right through the maze

### Role 0 (player) Generalist Semantic Action List.

#### REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Pause briefly.
- - move\_up: Move up.
- - move\_down: Move down.
- - move\_left: Move left.
- - move\_right: Move right.

## 24-restless-wing-syndrome (Restless Wing Syndrome).

### Game Rules Prompt.You are playing Restless Wing Syndrome, a platformer with automatic flaps.

Game Objective.

- - Reach the exit (looks like a bread) of each level.
- - Avoid spikes and hazards.

Game Rules.

- - The bird flaps automatically on a timer (flap meter at top left). The flap meter will automatically decrease, one cell at a time. When it is empty, the bird will automatically jump up then the meter will be recharged to full. You can observe the flap meter to time your movements.
- - You can move left or right to steer mid-air.
- - Hold Up to glide (slow descent) in the air.
- - Use Up+Left or Up+Right to glide diagonally.
- - Most of the time the movements are achieved by gliding with appropriate direction.

Role 0 (player) Game Agent Role Prompt.

You control the bird. Steer left/right to navigate platforms and hazards. Hold Up to glide (slow descent). Use Up+Left/Right to glide diagonally. You can use glide actions consecutively for sustained slow descent.

Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowLeft / ArrowRight: Move left/right
- - ArrowUp: Hold to glide (slow descent, uses flap meter)
- - ArrowUp + ArrowLeft / ArrowRight: Glide while moving left/right

Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Stay still briefly.
- - move\_left: Move left.
- - move\_right: Move right.
- - glide: Hold to slow descent using flap meter.
- - glide\_left: Glide while moving left (slow descent + left).
- - glide\_right: Glide while moving right (slow descent + right).

**25-rocket-league-2d (Rocket League 2D).**

Game Rules Prompt.

You are playing Rocket League 2D, a side-view car soccer game.

Game Objective.

- - Score goals by hitting the ball into the opponent's net.
- - Prevent the opponent from scoring.

Game Rules.

- - Drive left/right to position your car.
- - Jump to hit the ball in the air.
- - Boost can increase speed if available.

Role 0 (player) Game Agent Role Prompt.

You control the blue car. Drive, jump, and boost to hit the ball and score goals.

Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - A/D: Drive
- - W: Jump
- - Space: Boost or speed up (if available)

Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Pause briefly.
- - move\_left: Drive left.
- - move\_right: Drive right.
- - jump: Jump to hit the ball.
- - boost: Use boost or speed up if supported.

**26-run-3 (Run 3).**

Game Rules Prompt.

You are playing Run 3, an endless tunnel runner.Game Objective.  
- Survive as long as possible while navigating gaps.  
Game Rules.  
- Strafe left/right to navigate gaps.  
- Jump to cross gaps.

#### Role 0 (player) Game Agent Role Prompt.

You are the runner. Control the player to strafe and jump over gaps. You run through tunnels in Run 3.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):  
- ArrowLeft / ArrowRight: strafe left/right.  
- ArrowUp or Space: jump.

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:  
- wait: Pause briefly.  
- move\_left: Strafe left along the tunnel.  
- move\_right: Strafe right along the tunnel.  
- jump: Jump across gaps.

### 27-stack (Stack).

#### Game Rules Prompt.

You are playing Stack, a timing-based block stacking game.  
Game Objective.  
- Drop each moving block to align with the stack.  
- Build the highest stack possible.  
Game Rules.  
- Each block moves back and forth over the tower.  
- Press Space to drop the block.  
- Any overhanging part is cut off.  
- The game ends when there is no overlap.

#### Role 0 (player) Game Agent Role Prompt.

You control the stacking blocks. You are stacking blocks by timing drops.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):  
- Wait: Do nothing to wait the stack to move  
- Space: Drop the block once

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:  
- wait: Wait briefly.  
- drop\_block: Drop the block to place it.

### 28-temple-run-2 (Temple Run 2).

#### Game Rules Prompt.

You are playing Temple Run 2, an endless runner.  
Game Objective.  
- Survive as long as possible without missing turns and avoid obstacles.  
Game Rules.  
- Run along the paths and avoid obstacles.  
- Fail to turn, jump, or slide and you will die.

#### Role 0 (player) Game Agent Role Prompt.

You are the runner. Control the player to turn, jump, and slide to avoid obstacles. You run in Temple Run 2.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):  
- A/D: switch lanes and take turns.- - W: jump.
- - S: slide.
- - Space: start/continue.

### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Pause briefly.
- - start: Start or continue the run.
- - turn\_left: Switch left lanes or take a left turn.
- - turn\_right: Switch right lanes or take a right turn.
- - jump: Jump over obstacles.
- - slide: Slide under obstacles.

## 29-tetris (Tetris).

### Game Rules Prompt.

You are playing Tetris, a falling block puzzle game.

Game Objective.

- - Clear lines by filling them with blocks.
- - Prevent the stack from reaching the top and get the highest score possible.

Game Rules.

- - Pieces fall from the top and can be moved or rotated.
- - Completed horizontal lines clear and score points.
- - The game ends when new pieces can no longer spawn or the stack reaches the top.

### Role 0 (player) Game Agent Role Prompt.

You control the falling pieces. Control the player to move, rotate, drop, and swap pieces. You are playing Tetris. Keep the stack low and clear lines efficiently.

### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowLeft / ArrowRight: move
- - ArrowUp or X: rotate clockwise
- - Z: rotate counter-clockwise
- - ArrowDown: soft drop
- - Space: hard drop
- - Shift or C: Swap piece
- - Enter or click: start/retry

### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Wait briefly.
- - move\_left: Shift the piece left.
- - move\_right: Shift the piece right.
- - soft\_drop: Soft drop the piece faster.
- - rotate\_cw: Rotate the piece clockwise.
- - rotate\_ccw: Rotate the piece counter-clockwise.
- - hard\_drop: Hard drop the piece.
- - hold\_piece: Hold or swap the current piece.

## 30-vex-3 (Vex 3).

### Game Rules Prompt.

You are playing Vex 3, a precision platformer with traps.

Game Objective.

- - Reach the checkpoints and then the exit of each level.
- - Avoid spikes, saws, and falling hazards.

Game Rules.

- - Move left/right and jump between platforms.
- - Some sections require crouch sliding or dropping through gaps.
- - Death resets you to checkpoints.

### Role 0 (player) Game Agent Role Prompt.

You control the Vex character. Control the player to move, jump, and slide. You are a platform runner. Reach the exit while avoiding traps.

### Role 0 (player) Computer-Use Controls Prompt.**ACTION SPACE (ONLY LEGAL ACTIONS):**

- - ArrowLeft / ArrowRight or A/D: move
- - ArrowUp + ArrowLeft / ArrowRight: jump left/right
- - ArrowDown + ArrowLeft / ArrowRight: crouch slide left/right

**Role 0 (player) Generalist Semantic Action List.**

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Pause briefly.
- - move\_left: Move left.
- - move\_right: Move right.
- - jump: Jump.
- - jump\_left: Jump left.
- - jump\_right: Jump right.
- - crouch\_slide\_left: Crouch slide left.
- - crouch\_slide\_right: Crouch slide right.

**31-wolf3d (Wolfenstein 3D).**

**Game Rules Prompt.**

You are playing Wolfenstein 3D, a first-person shooter game.

**Game Objective.**

- - Search for and defeat enemy guards by shooting them.
- - Survive while maximizing kills.

**Game Rules.**

- - You control a soldier in a first-person 3D view.
- - Guards will shoot at you when triggered (shot by you or you are nearby), reducing your health.
- - You start with a pistol and 8 bullets.
- - Killed guards may drop ammo clips.
- - If your health reaches 0, you die and respawn (losing one life).
- - The game ends when you run out of lives.
- - Some enemies are in an adjacent room behind a door. Use the SPACE key near a door to open it.

**Role 0 (player) Game Agent Role Prompt.**

You control a soldier in first-person view. Shoot guards to defeat them. Manage your ammo and health. Use arrow keys to move and turn, X to shoot, SPACE to open doors.

**Role 0 (player) Computer-Use Controls Prompt.**

**ACTION SPACE (ONLY LEGAL ACTIONS):**

- - ArrowUp: Move forward
- - ArrowDown: Move backward
- - ArrowLeft: Turn left
- - ArrowRight: Turn right
- - x: Shoot / Attack
- - Space: Use / Open door

**Role 0 (player) Generalist Semantic Action List.**

REGISTERED ACTIONS (Semantic Controls).  
Choose exactly one action per step:

- - wait: Pause briefly to observe.
- - move\_forward: Move forward.
- - move\_backward: Move backward.
- - turn\_left: Turn left.
- - turn\_right: Turn right.
- - shoot: Shoot / Attack.
- - open\_door: Use / Open door.

**32-wordle (Wordle).**

**Game Rules Prompt.**

You are playing Wordle, a word guessing game.

**Game Objective.**

- - Guess the hidden five-letter word in six tries.
- - Use color feedback to refine guesses.

**Game Rules.**

- - Each guess must be a valid five-letter word.
- - Green means correct letter in the correct place.
- - Yellow means correct letter in the wrong place.
- - Gray means the letter is not in the word.
- - The guess is auto-submitted when you type the 5th letter.

**Role 0 (player) Game Agent Role Prompt.**You control the keyboard input. Type a valid five-letter word to make a guess. The guess is auto-submitted when 5 letters are entered. Use color feedback (green/yellow/gray) to narrow down the answer.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - Type a 5-letter word (e.g., type "hello").
- - The guess is auto-submitted when 5 letters are entered.
- - Use Backspace to delete a letter

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Wait briefly.
- - type\_text: Type letters by setting a text value.
- - submit\_guess: Confirm the guess by pressing Enter.

### 33-worlds-hardest-game (World's Hardest Game).

#### Game Rules Prompt.

You are playing World's Hardest Game, a precision maze dodge game.

Game Objective.

- - Collect all coins in the level.
- - Reach the green exit zone to complete the level.

Game Rules.

- - You control a red square in a maze.
- - Blue enemies move on fixed paths and kill you on contact.
- - Checkpoints save progress after collecting items. Use checkpoints to split the level into safe segments.
- - Observe enemy cycles carefully to avoid them.

#### Role 0 (player) Game Agent Role Prompt.

You control the red square. Move to collect coins and reach the exit while avoiding enemies. Avoid blue enemies, collect all coins, and reach the green exit.

#### Role 0 (player) Computer-Use Controls Prompt.

ACTION SPACE (ONLY LEGAL ACTIONS):

- - ArrowUp: Move up
- - ArrowDown: Move down
- - ArrowLeft: Move left
- - ArrowRight: Move right

#### Role 0 (player) Generalist Semantic Action List.

REGISTERED ACTIONS (Semantic Controls).

Choose exactly one action per step:

- - wait: Pause briefly to observe.
- - move\_up: Move up.
- - move\_down: Move down.
- - move\_left: Move left.
- - move\_right: Move right.

### 34-worlds-hardest-game-2 (World's Hardest Game 2).

#### Game Rules Prompt.

You are playing World's Hardest Game, a precision maze dodge game.

Game Objective.

- - Collect all coins in the level.
- - Reach the green exit zone to complete the level.

Game Rules.

- - You control a red square in a maze.
- - Blue enemies move on fixed paths and kill you on contact.
- - Observe enemy cycles carefully to avoid them

#### Role 0 (player) Game Agent Role Prompt.

You control the red square. Move to collect items and reach the exit while avoiding enemies. Avoid blue enemies, collect all items, and reach the green exit.

#### Role 0 (player) Computer-Use Controls Prompt.**ACTION SPACE (ONLY LEGAL ACTIONS):**

- - ArrowUp: Move up
- - ArrowDown: Move down
- - ArrowLeft: Move left
- - ArrowRight: Move right

**Role 0 (player) Generalist Semantic Action List.**

**REGISTERED ACTIONS (Semantic Controls).**  
Choose exactly one action per step:

- - wait: Pause briefly to observe.
- - move\_up: Move up.
- - move\_down: Move down.
- - move\_left: Move left.
- - move\_right: Move right.

### E.3 Model Output-Format Blocks

Below we list the exact output\_format block from each registered model specification.

**Claude-Sonnet-4.6 (Computer-Use).**

- - Use the Claude computer-use tool to return exactly one action or key combination per step.
- - You are not allowed to take screenshots on your own.
- - Do not output free-form text outside tool calls.

**Claude-Sonnet-4.6 (Generalist).**

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

**Gemini-2.5-Computer-Use.**

- - Use the built-in computer-use tool to return one action or key combination per step.
- - Do not output free-form text outside tool calls.

**Gemini-3-Flash-Preview.**

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

**GLM-4.6V.**

- - You must call exactly ONE tool per step with tool\_calls.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

**Grok-4.1-Fast-Reasoning.**

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

**Kimi-K2.5.**

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

**Qwen3-VL-235B-A22B (Computer-Use).**Response format for every step:  
A <think> ... </think> block of a very short sentence describing what to do.  
A single <tool\_call>...</tool\_call> block containing only the JSON: {"name": "<function-name>", "arguments": <args-  
json-object>}

Use the 'computer\_use' tool call and return exactly one action per step.  
Use only the <think> and the <tool\_call> block; do not add any other text.

### Qwen3-VL-235B-A22B (Generalist).

Response format for every step:  
A <think> ... </think> block of a very short sentence describing what to do.  
A single <tool\_call>...</tool\_call> block containing only the JSON: {"name": "<function-name>", "arguments": <args-  
json-object>}

Use only the <think> and the <tool\_call> block; do not add any other text.

### Qwen3-VL-30B-A3B (Computer-Use).

Response format for every step:  
A <think> ... </think> block of a very short sentence describing what to do.  
A single <tool\_call>...</tool\_call> block containing only the JSON: {"name": "<function-name>", "arguments": <args-  
json-object>}

Use the 'computer\_use' tool call and return exactly one action per step.  
Use only the <think> and the <tool\_call> block; do not add any other text.

### Qwen3-VL-30B-A3B (Generalist).

Response format for every step:  
A <think> ... </think> block of a very short sentence describing what to do.  
A single <tool\_call>...</tool\_call> block containing only the JSON: {"name": "<function-name>", "arguments": <args-  
json-object>}

Use only the <think> and the <tool\_call> block; do not add any other text.

### OpenAI-Computer-Use.

- - Use the computer-use tool to return exactly one action or key combination per step.
- - Do not output free-form text outside tool calls.
- - Do not take screenshots on your own.

### GPT-5.2.

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

### Qwen3-VL-Plus (Computer-Use).

- - Use the 'computer\_use' tool call and return exactly one action or key combination per step.
- - Do not output free-form text outside <tool\_call> blocks.

### Qwen3-VL-Plus (Generalist).

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

### Seed-1.8 (Computer-Use).

ACTION FORMAT (use exactly this syntax):

- - Press single key: hotkey(key='<key>')
- - Press multiple keys: hotkey(key='<key1> <key2>')
- - Click at position: click(point='<point>x y</point>')
- - Right click at position: right\_single(point='<point>x y</point>')
- - Wait/observe: wait()

Examples:

- - hotkey(key='w')

# Press W
- - hotkey(key='w d')

# Press W and D together (jump right)
- - hotkey(key='arrowup')

# Press Up arrow
- - click(point='<point>640 360</point>')

# Click at center
- - right\_single(point='<point>640 360</point>')

# Right click at center

### Seed-1.8 (Generalist).

- - You must call exactly ONE tool per step.
- - The tool name must be a registered action id.
- - Include 'reasoning' as a short rationale.
- - Do not output free-form text.

### UI-TARS-1.5-7B.

ACTION FORMAT (use exactly this syntax):

Press single key: hotkey(key='')
Press multiple keys: hotkey(key=' ')
Click at position: click(point='x y')
Right click at position: right_single(point='x y')
Wait/observe: wait()

Examples:

hotkey(key='w')

Press W

hotkey(key='w d')

Press W and D together (jump right)

hotkey(key='arrowup')

Press Up arrow

click(point='640 360')

Click at center

right_single(point='640 360')

Right click at center


## F Costs and Licensing Considerations

*Licensing Considerations.* **GameWorld** spans both proprietary and open-source browser games. We are sincerely grateful to the original game creators and rights holders whose work makes this benchmark possible. Our release policy is designed to respect upstream authorship, licensing terms, and distribution requirements. In particular, users are responsible for purchasing or obtaining lawful access to any benchmarked game. Any game access or distribution must comply with the applicable licenses and permissions.

Here we explicitly state the following licensing and compliance guidelines: "This project is intended strictly for research and benchmarking purposes. It does not grant any rights to access, reproduce, distribute, modify, or commercially use third-party games or related assets beyond those permitted by applicable licenses, terms of service, and law. Users are solely responsible for purchasing or obtaining lawful access and any necessary permissions for evaluation, dataset creation, model development, or downstream use."

**Table 13** Estimated benchmark cost of evaluating all 170 tasks for each model. Input / output token counts are averaged per step.

<table border="1"><thead><tr><th>Model</th><th>Input<br/>Tokens / Step</th><th>Output<br/>Tokens / Step</th><th>Total Cost<br/>(USD)</th></tr></thead><tbody><tr><td>Claude-Sonnet-4.6 (Computer-Use)</td><td>3344.9</td><td>131.7</td><td>172.46</td></tr><tr><td>Claude-Sonnet-4.6 (Generalist)</td><td>4865.8</td><td>86.3</td><td>244.03</td></tr><tr><td>Gemini-2.5-Computer-Use</td><td>2086.6</td><td>15.4</td><td>41.06</td></tr><tr><td>Gemini-3-Flash-Preview</td><td>4005.6</td><td>40.5</td><td>29.13</td></tr><tr><td>GLM-4.6V</td><td>4820.2</td><td>253.2</td><td>24.79</td></tr><tr><td>GPT-5.2</td><td>3924.6</td><td>38.9</td><td>110.68</td></tr><tr><td>Grok-4.1-Fast-Reasoning</td><td>2551.1</td><td>907.7</td><td>9.86</td></tr><tr><td>Kimi-K2.5</td><td>5030.9</td><td>250.0</td><td>45.77</td></tr><tr><td>OpenAI-Computer-Use</td><td>1684.4</td><td>87.2</td><td>94.65</td></tr><tr><td>Qwen3-VL-Plus (Computer-Use)</td><td>2112.3</td><td>27.7</td><td>4.99</td></tr><tr><td>Qwen3-VL-Plus (Generalist)</td><td>3675.1</td><td>61.3</td><td>9.01</td></tr><tr><td>Seed-1.8 (Computer-Use)</td><td>2184.0</td><td>210.6</td><td>13.91</td></tr><tr><td>Seed-1.8 (Generalist)</td><td>2642.6</td><td>179.9</td><td>14.85</td></tr><tr><td colspan="3"><b>Total Cost (all listed models)</b></td><td><b>815.19</b></td></tr></tbody></table>

*Cost Summary.* Table 13 reports average input and output tokens per step, together with the estimated total dollar cost for evaluating all 170 benchmark tasks. For the input-token column, we include cached input tokenswhen present, i.e., `per_step_input` + `per_step_cache` if the model API supports caching. The total cost column is the measured average cost per task from all the trace logs. The underlying model pricing estimates are taken from the pricing snapshot recorded on March 7, 2026. The cost for open-weight models, including Qwen3-VL-235B-A22B, Qwen3-VL-30B-A3B, and UI-TARS-1.5-7B, is not included in the calculation. The final total cost for evaluating all 170 tasks across all listed models is **815.19 USD**.

## References

- [1] Max Abernethy. Cubefield. Flash game (preserved on Internet Archive), 2006. URL [https://archive.org/details/cubefield\\_flash](https://archive.org/details/cubefield_flash).
- [2] Amazing Adam. Vex 3. Browser platform game, 2014. URL <https://apps.microsoft.com/detail/9ntlfr2tdg7z>.
- [3] Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, and Gunhee Kim. Flashadventure: A benchmark for gui agents solving full story arcs in diverse adventure games. In *EMNLP*, 2025.
- [4] Oslo Albet. Fireboy and watergirl. Flash puzzle-platform game, 2009. URL [https://en.wikipedia.org/wiki/Fireboy\\_and\\_Watergirl](https://en.wikipedia.org/wiki/Fireboy_and_Watergirl).
- [5] Anthropic. Claude: Constitutional ai models from anthropic. <https://www.anthropic.com/claude>, 2024. Official description of the Claude model family.
- [6] Atari, Inc. Breakout. Arcade game manual (preserved digital artifact), 1976. URL <https://archive.org/details/ArcadeGameManualBreakout>.
- [7] Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks, 2026.
- [8] Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report. *arXiv preprint arXiv:2511.21631*, 2025.
- [9] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. *Advances in Neural Information Processing Systems*, 35:24639–24654, 2022.
- [10] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. *Science*, 378(6624):1067–1074, 2022.
- [11] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. *arXiv preprint arXiv:1612.03801*, 2016.
- [12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- [13] Gabriele Cirulli. 2048. GitHub repository (browser game), 2014. URL <https://github.com/gabrielecirulli/2048>.
- [14] Joseph Cloutier. Run 3. Browser game, 2014. URL <https://player03.com/run/3/beta/>.
- [15] Coolmath Games. Another gentleman’s adventure. Browser game page (Coolmath Games), 2018. URL <https://www.coolmathgames.com/0-another-gentlemans-adventure>.
- [16] Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. In *Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on*Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer, 2019.

- [17] Stephen Critoph. The world’s hardest game. Flash game, 2007. URL [https://en.wikipedia.org/wiki/The\\_World%27s\\_Hardest\\_Game](https://en.wikipedia.org/wiki/The_World%27s_Hardest_Game).
- [18] Stephen Critoph. The world’s hardest game 2. Flash game, 2008. URL [https://archive.org/details/worlds\\_hardestgame2\\_202310](https://archive.org/details/worlds_hardestgame2_202310).
- [19] Jacob DeBenedetto. Boxel rebound. Browser game / extension distribution (official developer site), 2017. URL <https://www.dopplercreative.com/boxel-rebound/privacy-policy>.
- [20] Dedra Games. Ovo. Google Play app listing, 2018. URL <https://play.google.com/store/apps/details?id=com.dedra.ovo>.
- [21] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL <https://openreview.net/forum?id=kiYqb03wqw>.
- [22] Cody Ebberson. The adventures of captain callisto. JS13K Games entry (browser game), 2021. URL <https://js13kgames.com/entries/the-adventures-of-captain-callisto>.
- [23] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- [24] Karol Franz. Fullscreenmario. GitHub repository (browser game engine/implementation), 2013. URL <https://github.com/karol-f/FullScreenMario>.
- [25] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
- [26] Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL <https://arxiv.org/abs/2312.11805>.
- [27] Google. Gemini 2.5 computer use model card. <https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf>, 2025. System card describing the Gemini 2.5 Computer Use model.
- [28] Google Chrome team. Chrome dino (offline dinosaur game). Built-in browser game in Google Chrome, 2014. URL <https://blog.google/products-and-platforms/products/chrome/chrome-dino/>.
- [29] Google LLC. Google snake. Google Doodle browser game, 2013. URL [https://www.google.com/fbx?fbx=sna\\_ke\\_arcade](https://www.google.com/fbx?fbx=sna_ke_arcade).
- [30] Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025.
- [31] Hextris contributors. Hextris. GitHub repository (browser game), 2014. URL <https://github.com/Hextris/hextris>.
- [32] Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025.
- [33] Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. Imggame-bench: How good are llms at playing games?, 2025.
- [34] Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use. arXiv preprint arXiv:2411.10323, 2024.
- [35] IdeiGeniale. Geodash 2.2. GitHub repository (browser game), 2025. URL <https://github.com/IdeiGeniale/GeoDash2.2>.
- [36] Imangi Studios. Temple run 2. Google Play listing, 2013. URL <https://play.google.com/store/apps/details?id=com.imangi.templerun2>.- [37] Toru Iwatani. Pac-man. Official franchise history page, 1980. URL <https://pacman.com/en/history/>.
- [38] Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents, 2025.
- [39] Ketchapp. Stack. App Store listing, 2016. URL <https://apps.apple.com/us/app/stack/id1080487957>.
- [40] Leko. Restless wing syndrome. itch.io game page, 2020. URL <https://leko.itch.io/restless-wing-syndrome>.
- [41] Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse. *arXiv preprint arXiv:2503.16365*, 2025.
- [42] Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. *Advances in Neural Information Processing Systems*, 36:69900–69929, 2023.
- [43] Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, et al. Nitrogen: An open foundation model for generalist gaming agents, 2025. Preprint.
- [44] Gurpreet Singh Matharoo. Rocket league 2d. itch.io game page, 2017. URL <https://matharoo.itch.io/r12d>.
- [45] Microsoft. Microsoft minesweeper. Microsoft Store app listing, 2012. URL <https://apps.microsoft.com/detail/9wzdncrfhwcn>.
- [46] Microsoft Edge team. Microsoft edge surf. Built-in browser game in Microsoft Edge, 2020. URL <https://blogs.windows.com/msedgedev/2020/05/26/surf-game-edge-stable/>.
- [47] NAGI-P SOFT. Ns-shaft. Official developer download/info page, 2001. URL <https://www.nagi-p.com/v1/eng/nsshift.html>.
- [48] Dong Nguyen. Flappy bird. Mobile game, 2013. URL [https://en.wikipedia.org/wiki/Flappy\\_Bird](https://en.wikipedia.org/wiki/Flappy_Bird).
- [49] OpenAI. Computer-using agent. <https://openai.com/index/computer-using-agent/>, 2025. Accessed 2025-08-18.
- [50] OpenAI. Gpt-5 system card. <https://cdn.openai.com/gpt-5-system-card.pdf>, 2025. System card describing the GPT-5 model family.
- [51] Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2025. ICLR 2025.
- [52] Alexey Pajitnov. Tetris. Official history/about page, 1984. URL <https://tetris.com/about>.
- [53] Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, et al. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. Preprint and project release.
- [54] Igor Pušenjak and Marko Pušenjak. Doodle jump. Mobile game, 2009. URL [https://en.wikipedia.org/wiki/Doodle\\_Jump](https://en.wikipedia.org/wiki/Doodle_Jump).
- [55] randomyang. Core ball. GitHub repository (HTML5 browser game), 2015. URL <https://github.com/randomyang/core-ball>.
- [56] Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. *arXiv preprint arXiv:2302.04761*, 2023.
- [57] ByteDance Seed. Ui-tars-1.5. <https://seed-tars.com/1.5>, 2025.
- [58] Jacob Seidelin. Wolfenstein 3d html5. GitHub repository, 2012. URL <https://github.com/jseidelin/wolf3d>.
- [59] SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al. Scaling instructable agents across many simulated worlds, 2024.
- [60] SIMA Team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, et al. Sima 2: A generalist embodied agent for virtual worlds, 2025.- [61] Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, and Xiaojie Wang. Collab-overcooked: Benchmarking and evaluating large language models as collaborative agents, 2025.
- [62] Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, and Cor-Paul Bezemer. Videogameqa-bench: Evaluating vision-language models for video game quality assurance. In *NeurIPS Datasets and Benchmarks Track*, 2025. Paper reports 9 QA task types and 4,786 questions over 800+ games.
- [63] Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. [arXiv preprint arXiv:2403.03186](https://arxiv.org/abs/2403.03186), 2024.
- [64] Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds. [arXiv preprint arXiv:2511.08892](https://arxiv.org/abs/2511.08892), 2025.
- [65] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. [arXiv preprint arXiv:2507.20534](https://arxiv.org/abs/2507.20534), 2025.
- [66] Rye Terrell. Astray. GitHub repository and GitHub Pages browser game, 2015. URL <https://github.com/wwtyro/Astray>.
- [67] TinyDobbins. Monkey mart. Browser game page (Poki), 2022. URL <https://poki.com/en/g/monkey-mart>.
- [68] Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodal verifiable game data to boost vlms' general reasoning, 2025.
- [69] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. [arXiv preprint arXiv:2305.16291](https://arxiv.org/abs/2305.16291), 2023.
- [70] Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players?, 2025.
- [71] Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. [arXiv preprint arXiv:2311.05997](https://arxiv.org/abs/2311.05997), 2023.
- [72] Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents, 2025. Technical report.
- [73] Josh Wardle. Wordle. Browser game, 2021. URL <https://en.wikipedia.org/wiki/Wordle>.
- [74] xAI. Grok: xai's multimodal reasoning model. <https://x.ai/blog/grok>, 2024. Official description of the Grok model family.
- [75] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024.
- [76] Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, and Chen Wei. Play to generalize: Learning to reason through game play, 2025.
- [77] Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng. Deepphy: Benchmarking agentic vlms on physical reasoning, 2025. URL <https://arxiv.org/abs/2508.05405>.
- [78] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. [arXiv preprint arXiv:2210.03629](https://arxiv.org/abs/2210.03629), 2022.
- [79] Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, and Ofir Press. Videogamebench: Can vision-language models complete popular video games?, 2025.
- [80] Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, and Yiming Li. Gameverse: Can vision-language models learn from video-based reflection?, 2026.
- [81] Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, and Gaoang Wang. See and think: Embodied agent in virtual environment. [arXiv preprint arXiv:2311.15209](https://arxiv.org/abs/2311.15209), 2023.- [82] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning, 2024. URL <https://openreview.net/forum?id=pi ecKJ2D1B>.
- [83] Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, and Lijuan Wang. V-mage: A game evaluation framework for assessing vision-centric capabilities in multimodal large language models. arXiv preprint arXiv:2504.06148, 2025.
- [84] Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Qiang Fu, Haobo Fu, Zilong Zheng, and Yitao Liang. Mcu: An evaluation framework for open-ended game agents. In Forty-second International Conference on Machine Learning, 2025.
- [85] Zhipu AI / Z.ai. Minecraft clone. Web demo (capability gallery), 2026. URL <https://showcase.z.ai/>.
- [86] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.

Xet Storage Details

Size:: 191 kB
Xet hash:: 83ddb1c660064688bee3110b8e4fd974fa8861526de73ab1c6684d1559e042f5

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Abstract

Contents

1 Introduction

2 Game Agent

2.1 Agent Interfaces

move pointer

primary click

secondary click

double click

triple click

drag pointer

scroll up

scroll down

text entry

single key

key combo

idle

player: forward

player: backward

player: strafe left

player: strafe right

view: camera up

view: camera down

view: camera turn left

view: camera turn right

action: player jump

action: player duck

item: fire weapon

idle: no operation

2.2 Computer-Use Agents: Low-Level Controls

2.3 Generalist Agents: Semantic Action Parsing

2.4 Agent Harnesses

2.4.1 Structured Prompt

2.4.2 Context Memory

2.4.3 Reasoning

2.4.4 Customized Function Calling

3 GameWorld Benchmark

3.1 Benchmark Design

Games

Tasks

Models

3.2 Games and Tasks

3.3 Game Information

3.4 Browser-Based Sandbox Environment

3.5 Outcome-Based State-Verifiable Evaluation

4 Experiments

4.1 Experiment Setup

4.2 Main Results

4.3 Benchmark Robustness Under Repeated Evaluation

4.4 Capability-Aligned Curriculum Analysis

4.5 Challenges and Analyses: Real-Time Interaction, Context-Memory Sensitivity, Action Validity, and Failure Modes

4.5.1 GameWorld-RT: Real-Time Benchmark

Model Output:The obstacle is coming from the left, so I should move left first.

Invalid Reason:The model does not generate an executable tool call but instead returns free-form natural language.

Model Output:"<think> I should first inspect the scene carefully before acting... </think> <tool_call> {"name": "

Invalid Reason:The tool-call block is never properly closed.

Model Output:craft_a_workbench()

Invalid Reason:The model returns a plausible tool call, but this semantic action is not registered.

Model Output:left_click(x=512, y=384)

Invalid Reason:The CUA model calls a computer-use tool, but mouse clicking is outside the allowed control space.

4.5.2 Context-Memory Sensitivity

4.5.3 Action Validity and Instruction Following

\text{valid_actions}(r)}{\sum_{r \in \mathcal{R}} \

\text{proposed_actions}(r)}. \quad (3)$$

4.5.4 Failure Modes and Analysis

5 Case Study

5.1 Game Agent Interface Comparison: Generalist Agent vs. CUA

Game Agent Interface Comparison: Generalist Agent vs. CUA

Long-Horizon Simulation: Minecraft Resource Collection

DESCRIPTION

Real-Time Reaction and Timing Control: Flappy Bird

DESCRIPTION

5.2 Long-Horizon Simulation: Minecraft Resource Collection

5.3 Real-Time Reaction and Timing Control: Flappy Bird

6 Related Work

6.1 Computer-Use Benchmarks with Online Environments

6.2 Video Game Benchmarks for LLM and MLLM Agents

6.3 Game Agents and Scalable Infrastructure

7 Conclusion

Model Output:
The obstacle is coming from the left, so I should move left first.

Invalid Reason:
The model does not generate an executable tool call but instead returns free-form natural language.

Model Output:
"<think> I should first inspect the scene carefully before acting... </think> <tool_call> {"name": "

Invalid Reason:
The tool-call block is never properly closed.

Model Output:
craft_a_workbench()

Invalid Reason:
The model returns a plausible tool call, but this semantic action is not registered.

Model Output:
left_click(x=512, y=384)

Invalid Reason:
The CUA model calls a computer-use tool, but mouse clicking is outside the allowed control space.