Datasets:

techdrizzdev
/

UI-TapBench

Tasks:

Visual Question Answering

Tags:

License:

Dataset card Files Files and versions

xet

Community

Dataset Viewer

The dataset viewer is not available for this subset.

Cannot get the split names for the config 'default' of the dataset.

Exception:    SplitsNotFoundError
Message:      The split names could not be parsed from the dataset config.
Traceback:    Traceback (most recent call last):
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 286, in get_dataset_config_info
                  for split_generator in builder._split_generators(
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 198, in _split_generators
                  for pa_metadata_table in self._read_metadata(downloaded_metadata_file, metadata_ext=metadata_ext):
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 329, in _read_metadata
                  pa_table = paj.read_json(
                             ^^^^^^^^^^^^^^
                File "pyarrow/_json.pyx", line 342, in pyarrow._json.read_json
                File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
                File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
              pyarrow.lib.ArrowInvalid: JSON parse error: Column(/bbox) changed from array to string in row 143
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/split_names.py", line 65, in compute_split_names_from_streaming_response
                  for split in get_dataset_split_names(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 340, in get_dataset_split_names
                  info = get_dataset_config_info(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 291, in get_dataset_config_info
                  raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
              datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

UI-TapBench

📝 Summary & Intention

UI-TapBench is an open-source benchmark created to evaluate the spatial precision of Large Multimodal Models (LMMs) in mobile environments.

As AI agents move toward "Actionable AI," the ability to translate a natural language instruction into exact screen coordinates is the most common point of failure. This dataset provides a standardized way to measure and improve how models handle dense UI layouts and list-based navigation, ensuring tap reliability in autonomous agents.

🚀 About Drizz

Reimagining Mobile App Testing with Vision AI.

At Drizz, we're building the world's fastest AI-powered testing agent for mobile apps — no locators, no scripting, just plain English. Mobile teams today move fast, but testing tools haven't kept up. Drizz replaces brittle, locator-based frameworks with a vision-based AI engine that understands your app like a human.

With Drizz, teams achieve:

⚡ 10x Faster Test Cycles
🎯 97%+ Test Accuracy
🛡️ Zero Flaky Tests via our vision-based engine

We are releasing UI-TapBench to help the community move toward a world where UI automation is as simple, reliable, and "human-like" as possible.

📊 Dataset Structure

Each entry in metadata.jsonl follows this schema:

Key	Description
`id`	Unique identifier for the sample.
`image`	Relative path to the screenshot (e.g., `images/841.png`).
`task`	The natural language command (e.g., "Tap on second option").
`bbox`	Ground truth coordinates: `[xmin, ymin, xmax, ymax]`.
`app_name`	The package name of the app being tested.
`function`	The targeted action type (default: `tap_call_llm`).

Example Entry

{
  "id": 841,
  "image": "images/841.png",
  "task": "Tap on second option in the list.",
  "bbox": [42, 733, 1038, 901],
  "app_name": "com.duolingo",
  "function": "tap_call_llm"
}

📈 Benchmark Results

We evaluated UI-TapBench across leading Large Multimodal Models (LMMs) to measure tap accuracy, spatial precision, and reliability for mobile UI interactions.

🔍 Competitor Comparison

Model	Accuracy	Precision	Recall	F1 Score
🏆 Drizz (ours)	94.51	96.22	98.16	97.18
gpt-5.1	21.72	23.35	75.61	35.68
gpt-5.2	44.83	45.71	95.88	61.91
gemini-pro	89.84	91.28	98.28	94.65
gemini-flash	81.44	83.78	96.67	89.77
qwen3.5-27b	92.98	94.98	97.61	96.28

💡 Key Takeaway

The results show that while several models perform well on general UI grounding tasks, Drizz demonstrates the highest benchmark performance on UI-TapBench, achieving strong spatial precision and reliable tap execution even in dense mobile UI layouts.

📜 License

Released under the Apache 2.0 License.

Downloads last month: 50

Total file size:

246 MB