Dataset Viewer
Auto-converted to Parquet Duplicate
Search is not available for this dataset
image
imagewidth (px)
2.05k
2.82k
End of preview. Expand in Data Studio

PhysTool-Bench: Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

🐱 GitHub | πŸ“„ Paper | 🏠 Project Page | πŸ€— HuggingFace Papers


πŸ“Š Dataset Summary

PhysTool-Bench is a multimodal benchmark designed to evaluate how well Multimodal Large Language Models (MLLMs) perceive, select, and sequence physical tools in real-world scenes. Unlike traditional tool-use benchmarks that focus on digital APIs, this dataset probes an MLLM's ability to ground functional reasoning in cluttered, physical environments.

The benchmark features 2,510 high-quality queries covering 2,678 unique physical tools across 57 distinct categories (e.g., manufacturing, healthcare, farming).

Key Features

  • Two‑Task Evaluation: Decouples pure visual recognition from functional planning and sequencing.
  • Real‑World Clutter: Each scene contains an average of 8.6 tools (3.1 required targets and 5.5 visually/functionally similar distractors).
  • Sequential Logic: 86.9% of the tasks require a strict execution order, rigorously testing the model's physical commonsense.

🎯 Supported Tasks

The dataset separates evaluation into two distinct tracks to pinpoint whether model failures stem from visual bottlenecks or poor physical reasoning.

Task I: Tool Recognition

  • Input: A real-world scenario image.
  • Objective: Enumerate all visible tools in the cluttered scene.
  • Purpose: Measures pure visual enumeration and recognition capabilities.

Task II: Tool Selection & Planning

  • Input: A real-world scenario image paired with a brief task instruction.
  • Objective: Output the exact, ordered sequence of tools required to complete the specified task.
  • Purpose: Measures functional mapping, physical commonsense, and multi-step planning capabilities.

πŸ“ Dataset Structure

Unlike standard text-to-text datasets, PhysTool-Bench relies on a decoupled structure to support complex visual reasoning evaluations. The repository contains the raw images and two primary metadata files:

  • images/: Directory containing all high-resolution physical scenario images.
  • generation_checkpoint.json: The input file used for model inference. It contains the image paths and task_instruct prompts for Task II.
  • corrected_tools.json: The ground truth file used for evaluation. It contains the refined taxonomy, required tools (target_tools), target_steps for ordered tasks, and negative_tools (distractors).
  • final_matching_info.json: The alignment and mapping metadata file utilized by the offline evaluation pipeline to support tool taxonomy normalization and rule-based verification.

Example: Loading the Raw Data

You can easily download and explore the raw dataset using the huggingface_hub or standard Python tools:

import json
import os
from huggingface_hub import snapshot_download
from PIL import Image

# 1. Download the dataset folder
dataset_path = snapshot_download(repo_id="ModalityDance/PhysTool-Bench", repo_type="dataset")

# 2. Load the input metadata
with open(os.path.join(dataset_path, "generation_checkpoint.json"), "r") as f:
    inputs = json.load(f)

# 3. Explore a sample
sample = inputs[0]
print(f"Task Instruction: {sample['task_instruct']}")

# Load corresponding image
img_path = os.path.join(dataset_path, sample['image_path'])
Image.open(img_path).show()

⚠️ Inference & Evaluation (Important)

Due to the complex nature of physical tool planning, standard HuggingFace pipelines (pipeline("visual-question-answering")) are not sufficient for evaluating this benchmark. To properly run PhysTool-Bench, please use our Official GitHub Repository.

Why use the official codebase?

  • Environment Isolation: Different MLLMs require conflicting dependency versions (e.g., PyTorch, Transformers, Accelerate). Our repo provides standalone inference scripts for major models.
  • Dual Evaluation Pipelines: Simple exact string matching fails on open-ended generation due to synonyms and morphological variations. We provide two robust alternatives:
    • Offline Evaluation (eval_offline.py): Fast, local rule-based matching using final_matching_info.json for API-free evaluation.
    • LLM-as-a-Judge (eval_gemini.py): Deep semantic one-to-one mapping via the Gemini API to resolve complex synonyms and functional equivalents.
    • Head over to ModalityDance/PhysTool-Bench for the complete quickstart guide, environment setups, and automated evaluation scripts.

πŸ“š Citation

If you use PhysTool-Bench in your research or applications, please consider citing:

@article{PhysTool-Bench2026,
  title        = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
  author       = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
  journal      = {arXiv preprint arXiv:2606.10803},
  year         = {2026}
}

πŸ“œ License

The dataset is released under the MIT license.

Downloads last month
1,436

Collection including ModalityDance/PhysTool-Bench

Paper for ModalityDance/PhysTool-Bench