ST-BiBench
Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

1Tsinghua University, 2The University of Hong Kong, 3HKUST, 4Beijing Innovation Center of Humanoid Robotics
*Equal Contribution Corresponding Authors

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over multiple action and perception streams. To investigate the "proximity paradox"—where semantically coherent plans fail to align with spatially grounded visual inputs—we incorporate Foundational Spatial Grounding to verify workspace awareness and arm-selection logic. Furthermore, we probe model frontiers through Fine-Grained Action Control, investigating whether MLLMs can directly synthesize high-dimensional continuous action modalities (16-Dim) from complex multimodal metadata. Evaluating 30+ state-of-the-art MLLMs, we uncover a persistent and pervasive "coordination paradox"—a significant gap between high-level strategic reasoning and fine-grained physical execution. Results reveal that while frontier MLLMs excel at logic-driven strategy, they frequently suffer from perception-logic disconnection and multi-stream interference during multimodal fusion. ST-BiBench provides a platform for identifying critical bottlenecks in multi-stream multimodal fusion and cross-modal alignment for complex embodied tasks.

Overview

ST-BiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs). While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions. ST-BiBench addresses this critical gap by providing a dedicated platform to analyze how foundation models manage the unique complexities of dual-arm physical interaction.

ST-BiBench Framework Overview

Figure 1: The hierarchical evaluation framework of ST-BiBench, deconstructing bimanual coordination into three parts of abstraction.

As illustrated in Figure 1, our benchmark features a comprehensive three-tier evaluation framework that deconstructs bimanual tasks into different levels of abstraction. Tier 1 (Strategic Coordination Planning) serves as the core reasoning engine to evaluate high-level cross-modal alignment, assessing the model's ability to decompose long-horizon instructions into independent parallel or sequential collaborative primitives. Tier 2 (Foundational Spatial Grounding) addresses the "proximity paradox" by verifying the model’s awareness of bilateral workspace constraints and its ability to resolve spatial-semantic conflicts during arm allocation. Tier 3 (Fine-Grained Action Control) tests the frontiers of dense modality fusion, requiring the model to directly synthesize 16-dimensional continuous action modalities for precise end-to-end actuation. This hierarchical design allows researchers to isolate specific failure modes and distinguish between high-level logical deficiencies, spatial hallucinations, and fine-grained execution bottlenecks.

ST-BiBench Agent Pipeline

Figure 2: The vision-driven agent pipeline designed for structured multimodal perception and reasoning.

The core of our evaluation is supported by a vision-driven agent pipeline designed for structured multimodal perception and reasoning (Figure 2). The agent processes diverse inputs—including multi-view observations (main and third-person views), language instructions, and task-specific auxiliary information—to bridge the gap between perception and action. Within each planning step, the MLLM functions as a central "brain" that generates a visual state description, performs internal reasoning and reflection, and formulates a language-based plan before outputting a structured, executable JSON format. This iterative closed-loop process ensures that the agent can adapt its coordination strategy based on the evolving environment state.

Through an extensive empirical study of over 30 state-of-the-art models—including proprietary systems like GPT-5, Gemini, and Claude—our results reveal a significant "reasoning-actuation gap." While modern MLLMs demonstrate proficiency in high-level strategic planning, they frequently struggle with fragile spatial grounding and precise dual-arm control. By pinpointing these bottlenecks, ST-BiBench provides a foundational framework and diagnostic tool for the community to develop more robust, versatile robotic agents capable of human-like physical coordination.

Hierarchical Evaluation Examples

Tier 1: Strategic Coordination Planning

Serving as the core reasoning engine, this tier evaluates the agent's ability to decompose long-horizon instructions into a sequence of atomic primitives. It assesses high-level logic over independent parallel manipulation and sequential collaborative tasks to ensure role consistency and temporal synchronization.

Task Success: handover_block (Strategic Coordination Planning)

Tier 2: Foundational Spatial Grounding

This tier addresses the "proximity paradox" by verifying the model's awareness of bilateral workspace constraints. It evaluates whether the MLLM can resolve spatial-semantic conflicts and assign the optimal arm while navigating overlapping kinematic zones and avoiding singularities.

High Quality Reasoning

High-quality reasoning: Precise grounding and optimal arm allocation.

Medium Quality Reasoning

Average-quality reasoning: Valid logic but with minor spatial ambiguity.

Low Quality Reasoning

Low-quality reasoning: Significant visual hallucinations and planning failures.

Tier 3: Fine-Grained Action Control

Testing the limits of dense modality fusion, this tier requires the model to directly synthesize 16-dimensional continuous action modalities. It assesses the transition from high-level perception to precise end-to-end bimanual actuation in dynamic environments.

Task Success: stack_blocks_two (Fine-Grained Action Control)

Hierarchical Evaluation Results

Table 1: Strategic Coordination Planning Results

Success rate (%) for independent parallel and sequential collaborative manipulation tasks.
Models Independent Parallel Manipulation Sequential Collaborative Manipulation Total Avg.
Avg.P1P2R1R2S1S2 Avg.H1H2H3P3P4P5P6P7
Gemini-2.5-Pro 71.33772288996280 69.389460839474356352 70.21
GPT-5 76.676450921008668 59.756924869071175863 67.00
Gemini-2.5-flash 67.17603682936468 59.007828789461354553 62.50
GPT-4.1 78.508140871008677 42.8847468896718140 58.14
Claude-sonnet-4 67.00573168978267 46.63923283973720111 55.36
Claude-sonnet-3.7 69.00683966968659 45.00953284923212112 55.29
Qwen3-VL-235B-A22B-Instruct 58.6736175908367 50.8896638692517246 54.21
InternVL3-38B 57.5071079775860 49.3863679097144033 52.86
Qwen3-VL-32B-Instruct 54.67411688754068 50.88936375961436822 52.50
Qwen2.5-VL-32B-Instruct 52.6762793882442 50.1394558895749013 51.21
Gemini-2.0-flash 62.83724387944338 41.25673381871533311 50.50
GPT-4o 52.33682250743763 45.5088608595516015 48.43
InternVL3-78B 56.3372769813970 33.6395 92696423016 43.36
InternVL2.5-38B 45.331083706256 33.009211843747020 38.29
Ovis2-34B 45.508027782131 31.7596314941027010 37.64
InternVL2.5-78B 47.83562273762733 29.518008433113025 37.36
InternVL3.5-38B 41.507206815238 33.1394111893128029 36.71
Qwen2.5-VL-72B-Instruct 28.60160/88336 37.257442855943202 33.92
Ovis2-16B 27.506702532383 24.8872019742500 26.00
Ovis2.5-9B 17.83410471603 28.757814713372700 24.07
Qwen3-VL-30B-A3B-Instruct 19.837061231513 26.25511545154404 23.50
Gemma-3-27b-it 27.176205126024 19.501024663811106 22.79
Llama-4-Scout-17B-16E-Instruct 10.6720073700 29.75819606441208 21.57
Gemma-3-12b-it 20.3383053400 13.8832283611409 16.64
Llama-3.2-11B-Vision-Instruct 6.5010231500 20.6368586911400 14.57
InternVL3-8B 13.8355016363 10.380171001100 11.86
InternVL2.5-8B 2.672013100 1.2509001000 1.86
Qwen2.5-VL-7B-Instruct 1.67304210 1.2510603000 1.43

Table 2: Foundational Spatial Grounding Results

Success scores across three scenario settings: Sparse, Dense, and Cluttered. "Avg." represents the overall mean performance.
Models Task scenario settings Avg.
Sparse Dense Cluttered
Gemini-2.0-flash95.4598.6992.0095.38
Gemini-2.5-flash95.7796.7692.8895.13
Gemini-2.5-pro96.1496.7792.1295.01
Claude-sonnet-4.596.1294.7892.2394.38
GPT-594.7395.1392.9794.28
GLM-4.5V91.4897.7793.0094.08
Qwen3-VL-32B-Instruct94.4795.7791.7794.00
Claude-sonnet-494.1394.4692.8893.82
Claude-sonnet-3.793.4695.1191.9493.51
InternVL3-78B92.8097.0790.1693.34
Ovis2-34B94.7892.7890.4592.67
GPT-4.193.4392.4891.7692.55
Ovis2-16B94.0791.7488.0091.27
Qwen3-VL-235B-A22B-Instruct86.8293.5090.3390.22
InternVL3.5-38B89.4891.4586.7589.23
GPT-4o89.0291.1387.1089.08
Qwen3-VL-30B-A3B-Instruct85.5091.1388.9888.54
InternVL3-38B81.8292.1389.8587.94
InternVL2.5-78B87.2186.4589.3787.68
Llama-4-Scout-17B-16E-Instruct85.4987.7586.1686.47
Gemma-3-27b-it92.4081.1285.7886.43
Qwen2.5-VL-32B-Instruct85.1686.0887.3886.21
InternVL2.5-38B79.1685.4785.9983.54
InternVL2.5-8B87.4878.9681.8182.75
InternVL3-8B79.5369.7986.7978.70
Ovis2.5-9B72.7978.1273.1374.68
Qwen2.5-VL-7B-Instruct75.2065.8379.3473.46
Gemma-3-12b-it80.0957.1770.2269.16
Llama-3.2-11B-Vision-Instruct54.6453.6254.0154.09

Table 3: Fine-Grained Action Control Performance

Success rate (%) on specific manipulation tasks.
Models Tasks Avg.
Place8Place9Place10Grab1Stack3
GPT-5668350795666.80
Gemini-2.5-Pro826139813860.20
Gemini-2.5-flash744813844953.60
InternVL3-78B850079127.60
Claude-sonnet-4.51713689225.40
Qwen3-VL-235B-A22B-Instruct4128946225.20
Gemma-3-27b-it8133706.20
Llama-4-Scout-17B-16E-Instruct1002906.00

Error Analysis

GPT-5 Error Distribution

(a) GPT-5

Gemini Error Distribution

(b) Gemini-2.5-Pro

Comparison of error type distributions. Visualization of failure modes for (a) GPT-5 and (b) Gemini-2.5-Pro. Inner rings represent primary error categories (Perceptual vs. Planning), while outer rings detail specific failure modes such as misjudgment or sequencing errors. Detailed definitions are provided in the Appendix.

We analyzed failure modes for GPT-5 and Gemini-2.5-Pro, excluding environmental noise. As illustrated in Figure 1, the primary bottleneck for GPT-5 is perceptual (54%), largely driven by Task State Estimation Misjudgment (39%). Furthermore, it exhibits a notable inability to strictly adhere to prompt-specified execution parameters, categorized as Action Parameter Inconsistency (23%).

Conversely, while Gemini-2.5-Pro follows prompt constraints more reliably, it is significantly more limited by complex planning logic (56%). Its main hurdles are Action Sequencing (31%) and Bimanual Conflict (24%), indicating deeper struggles with the temporal and spatial synchronization essential for sophisticated dual-arm coordination.

BibTeX

@article{wu2026ststbibench,
  title     = {ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs},
  author    = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
  journal   = {arXiv preprint arXiv:2602.08392},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.08392}
}