The dataset viewer is not available for this subset.
Exception: SplitsNotFoundError
Message: The split names could not be parsed from the dataset config.
Traceback: Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 286, in get_dataset_config_info
for split_generator in builder._split_generators(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 198, in _split_generators
for pa_metadata_table in self._read_metadata(downloaded_metadata_file, metadata_ext=metadata_ext):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/packaged_modules/folder_based_builder/folder_based_builder.py", line 329, in _read_metadata
pa_table = paj.read_json(
^^^^^^^^^^^^^^
File "pyarrow/_json.pyx", line 342, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column(/bbox) changed from array to string in row 143
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/split_names.py", line 65, in compute_split_names_from_streaming_response
for split in get_dataset_split_names(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 340, in get_dataset_split_names
info = get_dataset_config_info(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 291, in get_dataset_config_info
raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
UI-TapBench
π Summary & Intention
UI-TapBench is an open-source benchmark created to evaluate the spatial precision of Large Multimodal Models (LMMs) in mobile environments.
As AI agents move toward "Actionable AI," the ability to translate a natural language instruction into exact screen coordinates is the most common point of failure. This dataset provides a standardized way to measure and improve how models handle dense UI layouts and list-based navigation, ensuring tap reliability in autonomous agents.
π About Drizz
Reimagining Mobile App Testing with Vision AI.
At Drizz, we're building the world's fastest AI-powered testing agent for mobile apps β no locators, no scripting, just plain English. Mobile teams today move fast, but testing tools haven't kept up. Drizz replaces brittle, locator-based frameworks with a vision-based AI engine that understands your app like a human.
With Drizz, teams achieve:
- β‘ 10x Faster Test Cycles
- π― 97%+ Test Accuracy
- π‘οΈ Zero Flaky Tests via our vision-based engine
We are releasing UI-TapBench to help the community move toward a world where UI automation is as simple, reliable, and "human-like" as possible.
π Dataset Structure
Each entry in metadata.jsonl follows this schema:
| Key | Description |
|---|---|
id |
Unique identifier for the sample. |
image |
Relative path to the screenshot (e.g., images/841.png). |
task |
The natural language command (e.g., "Tap on second option"). |
bbox |
Ground truth coordinates: [xmin, ymin, xmax, ymax]. |
app_name |
The package name of the app being tested. |
function |
The targeted action type (default: tap_call_llm). |
Example Entry
{
"id": 841,
"image": "images/841.png",
"task": "Tap on second option in the list.",
"bbox": [42, 733, 1038, 901],
"app_name": "com.duolingo",
"function": "tap_call_llm"
}
π Benchmark Results
We evaluated UI-TapBench across leading Large Multimodal Models (LMMs) to measure tap accuracy, spatial precision, and reliability for mobile UI interactions.
π Competitor Comparison
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| π Drizz (ours) | 94.51 | 96.22 | 98.16 | 97.18 |
| gpt-5.1 | 21.72 | 23.35 | 75.61 | 35.68 |
| gpt-5.2 | 44.83 | 45.71 | 95.88 | 61.91 |
| gemini-pro | 89.84 | 91.28 | 98.28 | 94.65 |
| gemini-flash | 81.44 | 83.78 | 96.67 | 89.77 |
| qwen3.5-27b | 92.98 | 94.98 | 97.61 | 96.28 |
π‘ Key Takeaway
The results show that while several models perform well on general UI grounding tasks, Drizz demonstrates the highest benchmark performance on UI-TapBench, achieving strong spatial precision and reliable tap execution even in dense mobile UI layouts.
π License
Released under the Apache 2.0 License.
- Downloads last month
- 50