Datasets:

ModalityDance
/

PhysTool-Bench

Tasks:

Visual Question Answering

Image-to-Text

Modalities:

Formats:

Languages:

Size:

ArXiv:

Tags:

Libraries:

License:

Dataset card Data Studio Files Files and versions

xet

Community

Dataset Viewer

Auto-converted to Parquet Duplicate

Split (1)

train · 2.51k rows

Search is not available for this dataset

image imagewidth (px) 2.05k 2.82k

End of preview. Expand in Data Studio

PhysTool-Bench: Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

🐱 GitHub ｜ 📄 Paper ｜ 🏠 Project Page ｜ 🤗 HuggingFace Papers

📊 Dataset Summary

PhysTool-Bench is a multimodal benchmark designed to evaluate how well Multimodal Large Language Models (MLLMs) perceive, select, and sequence physical tools in real-world scenes. Unlike traditional tool-use benchmarks that focus on digital APIs, this dataset probes an MLLM's ability to ground functional reasoning in cluttered, physical environments.

The benchmark features 2,510 high-quality queries covering 2,678 unique physical tools across 57 distinct categories (e.g., manufacturing, healthcare, farming).

Key Features

Two‑Task Evaluation: Decouples pure visual recognition from functional planning and sequencing.
Real‑World Clutter: Each scene contains an average of 8.6 tools (3.1 required targets and 5.5 visually/functionally similar distractors).
Sequential Logic: 86.9% of the tasks require a strict execution order, rigorously testing the model's physical commonsense.

🎯 Supported Tasks

The dataset separates evaluation into two distinct tracks to pinpoint whether model failures stem from visual bottlenecks or poor physical reasoning.

Task I: Tool Recognition

Input: A real-world scenario image.
Objective: Enumerate all visible tools in the cluttered scene.
Purpose: Measures pure visual enumeration and recognition capabilities.

Task II: Tool Selection & Planning

Input: A real-world scenario image paired with a brief task instruction.
Objective: Output the exact, ordered sequence of tools required to complete the specified task.
Purpose: Measures functional mapping, physical commonsense, and multi-step planning capabilities.

📁 Dataset Structure

Unlike standard text-to-text datasets, PhysTool-Bench relies on a decoupled structure to support complex visual reasoning evaluations. The repository contains the raw images and two primary metadata files:

images/: Directory containing all high-resolution physical scenario images.
generation_checkpoint.json: The input file used for model inference. It contains the image paths and task_instruct prompts for Task II.
corrected_tools.json: The ground truth file used for evaluation. It contains the refined taxonomy, required tools (target_tools), target_steps for ordered tasks, and negative_tools (distractors).
final_matching_info.json: The alignment and mapping metadata file utilized by the offline evaluation pipeline to support tool taxonomy normalization and rule-based verification.

Example: Loading the Raw Data

You can easily download and explore the raw dataset using the huggingface_hub or standard Python tools:

import json
import os
from huggingface_hub import snapshot_download
from PIL import Image

# 1. Download the dataset folder
dataset_path = snapshot_download(repo_id="ModalityDance/PhysTool-Bench", repo_type="dataset")

# 2. Load the input metadata
with open(os.path.join(dataset_path, "generation_checkpoint.json"), "r") as f:
    inputs = json.load(f)

# 3. Explore a sample
sample = inputs[0]
print(f"Task Instruction: {sample['task_instruct']}")

# Load corresponding image
img_path = os.path.join(dataset_path, sample['image_path'])
Image.open(img_path).show()

⚠️ Inference & Evaluation (Important)

Due to the complex nature of physical tool planning, standard HuggingFace pipelines (pipeline("visual-question-answering")) are not sufficient for evaluating this benchmark. To properly run PhysTool-Bench, please use our Official GitHub Repository.

Why use the official codebase?

Environment Isolation: Different MLLMs require conflicting dependency versions (e.g., PyTorch, Transformers, Accelerate). Our repo provides standalone inference scripts for major models.
Dual Evaluation Pipelines: Simple exact string matching fails on open-ended generation due to synonyms and morphological variations. We provide two robust alternatives:
- Offline Evaluation (eval_offline.py): Fast, local rule-based matching using final_matching_info.json for API-free evaluation.
- LLM-as-a-Judge (eval_gemini.py): Deep semantic one-to-one mapping via the Gemini API to resolve complex synonyms and functional equivalents.
- Head over to ModalityDance/PhysTool-Bench for the complete quickstart guide, environment setups, and automated evaluation scripts.

📚 Citation

If you use PhysTool-Bench in your research or applications, please consider citing:

@article{PhysTool-Bench2026,
  title        = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
  author       = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
  journal      = {arXiv preprint arXiv:2606.10803},
  year         = {2026}
}

📜 License

The dataset is released under the MIT license.

Downloads last month: 121

Collection including ModalityDance/PhysTool-Bench

PhysTool-Bench

Collection

PhysTool-Bench is a benchmark that evaluates how well MLLMs perceive, select, and sequence PHYSICAL tools in real-world scenes. • 2 items • Updated Jun 10 • 1

Paper for ModalityDance/PhysTool-Bench

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Paper • 2606.10803 • Published Jun 9 • 3