ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over multiple action and perception streams. To investigate the "proximity paradox"—where semantically coherent plans fail to align with spatially grounded visual inputs—we incorporate Foundational Spatial Grounding to verify workspace awareness and arm-selection logic. Furthermore, we probe model frontiers through Fine-Grained Action Control, investigating whether MLLMs can directly synthesize high-dimensional continuous action modalities (16-Dim) from complex multimodal metadata. Evaluating 30+ state-of-the-art MLLMs, we uncover a persistent and pervasive "coordination paradox"—a significant gap between high-level strategic reasoning and fine-grained physical execution. Results reveal that while frontier MLLMs excel at logic-driven strategy, they frequently suffer from perception-logic disconnection and multi-stream interference during multimodal fusion. ST-BiBench provides a platform for identifying critical bottlenecks in multi-stream multimodal fusion and cross-modal alignment for complex embodied tasks.

Overview

ST-BiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs). While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions. ST-BiBench addresses this critical gap by providing a dedicated platform to analyze how foundation models manage the unique complexities of dual-arm physical interaction.

Figure 1: The hierarchical evaluation framework of ST-BiBench, deconstructing bimanual coordination into three parts of abstraction.

As illustrated in Figure 1, our benchmark features a comprehensive three-tier evaluation framework that deconstructs bimanual tasks into different levels of abstraction. Tier 1 (Strategic Coordination Planning) serves as the core reasoning engine to evaluate high-level cross-modal alignment, assessing the model's ability to decompose long-horizon instructions into independent parallel or sequential collaborative primitives. Tier 2 (Foundational Spatial Grounding) addresses the "proximity paradox" by verifying the model’s awareness of bilateral workspace constraints and its ability to resolve spatial-semantic conflicts during arm allocation. Tier 3 (Fine-Grained Action Control) tests the frontiers of dense modality fusion, requiring the model to directly synthesize 16-dimensional continuous action modalities for precise end-to-end actuation. This hierarchical design allows researchers to isolate specific failure modes and distinguish between high-level logical deficiencies, spatial hallucinations, and fine-grained execution bottlenecks.

Figure 2: The vision-driven agent pipeline designed for structured multimodal perception and reasoning.

The core of our evaluation is supported by a vision-driven agent pipeline designed for structured multimodal perception and reasoning (Figure 2). The agent processes diverse inputs—including multi-view observations (main and third-person views), language instructions, and task-specific auxiliary information—to bridge the gap between perception and action. Within each planning step, the MLLM functions as a central "brain" that generates a visual state description, performs internal reasoning and reflection, and formulates a language-based plan before outputting a structured, executable JSON format. This iterative closed-loop process ensures that the agent can adapt its coordination strategy based on the evolving environment state.

Through an extensive empirical study of over 30 state-of-the-art models—including proprietary systems like GPT-5, Gemini, and Claude—our results reveal a significant "reasoning-actuation gap." While modern MLLMs demonstrate proficiency in high-level strategic planning, they frequently struggle with fragile spatial grounding and precise dual-arm control. By pinpointing these bottlenecks, ST-BiBench provides a foundational framework and diagnostic tool for the community to develop more robust, versatile robotic agents capable of human-like physical coordination.

Hierarchical Evaluation Examples

Tier 1: Strategic Coordination Planning

Serving as the core reasoning engine, this tier evaluates the agent's ability to decompose long-horizon instructions into a sequence of atomic primitives. It assesses high-level logic over independent parallel manipulation and sequential collaborative tasks to ensure role consistency and temporal synchronization.

Task Success: handover_block (Strategic Coordination Planning)

Tier 2: Foundational Spatial Grounding

This tier addresses the "proximity paradox" by verifying the model's awareness of bilateral workspace constraints. It evaluates whether the MLLM can resolve spatial-semantic conflicts and assign the optimal arm while navigating overlapping kinematic zones and avoiding singularities.

High-quality reasoning: Precise grounding and optimal arm allocation.

Average-quality reasoning: Valid logic but with minor spatial ambiguity.

Low-quality reasoning: Significant visual hallucinations and planning failures.

Tier 3: Fine-Grained Action Control

Testing the limits of dense modality fusion, this tier requires the model to directly synthesize 16-dimensional continuous action modalities. It assesses the transition from high-level perception to precise end-to-end bimanual actuation in dynamic environments.

Task Success: stack_blocks_two (Fine-Grained Action Control)

Hierarchical Evaluation Results

Table 1: Strategic Coordination Planning Results

Success rate (%) for independent parallel and sequential collaborative manipulation tasks.

Models	Independent Parallel Manipulation							Sequential Collaborative Manipulation									Total Avg.
Models	Avg.	P1	P2	R1	R2	S1	S2	Avg.	H1	H2	H3	P3	P4	P5	P6	P7	Total Avg.
Gemini-2.5-Pro	71.33	77	22	88	99	62	80	69.38	94	60	83	94	74	35	63	52	70.21
GPT-5	76.67	64	50	92	100	86	68	59.75	69	24	86	90	71	17	58	63	67.00
Gemini-2.5-flash	67.17	60	36	82	93	64	68	59.00	78	28	78	94	61	35	45	53	62.50
GPT-4.1	78.50	81	40	87	100	86	77	42.88	47	46	88	96	7	18	1	40	58.14
Claude-sonnet-4	67.00	57	31	68	97	82	67	46.63	92	32	83	97	37	20	1	11	55.36
Claude-sonnet-3.7	69.00	68	39	66	96	86	59	45.00	95	32	84	92	32	12	1	12	55.29
Qwen3-VL-235B-A22B-Instruct	58.67	36	1	75	90	83	67	50.88	96	63	86	92	5	17	2	46	54.21
InternVL3-38B	57.50	71	0	79	77	58	60	49.38	63	67	90	97	1	44	0	33	52.86
Qwen3-VL-32B-Instruct	54.67	41	16	88	75	40	68	50.88	93	63	75	96	14	36	8	22	52.50
Qwen2.5-VL-32B-Instruct	52.67	62	7	93	88	24	42	50.13	94	55	88	95	7	49	0	13	51.21
Gemini-2.0-flash	62.83	72	43	87	94	43	38	41.25	67	33	81	87	15	33	3	11	50.50
GPT-4o	52.33	68	22	50	74	37	63	45.50	88	60	85	95	5	16	0	15	48.43
InternVL3-78B	56.33	72	7	69	81	39	70	33.63	9	5 9	26	96	4	23	0	16	43.36
InternVL2.5-38B	45.33	1	0	83	70	62	56	33.00	92	11	84	3	7	47	0	20	38.29
Ovis2-34B	45.50	80	2	77	82	1	31	31.75	96	3	14	94	10	27	0	10	37.64
InternVL2.5-78B	47.83	56	22	73	76	27	33	29.51	80	0	84	33	1	13	0	25	37.36
InternVL3.5-38B	41.50	72	0	6	81	52	38	33.13	94	11	18	93	12	8	0	29	36.71
Qwen2.5-VL-72B-Instruct	28.60	16	0	/	88	3	36	37.25	74	42	85	59	4	32	0	2	33.92
Ovis2-16B	27.50	67	0	25	32	38	3	24.88	72	0	1	97	4	25	0	0	26.00
Ovis2.5-9B	17.83	41	0	47	16	0	3	28.75	78	14	71	33	7	27	0	0	24.07
Qwen3-VL-30B-A3B-Instruct	19.83	7	0	61	23	15	13	26.25	51	1	54	51	5	44	0	4	23.50
Gemma-3-27b-it	27.17	62	0	51	26	0	24	19.50	10	24	66	38	1	11	0	6	22.79
Llama-4-Scout-17B-16E-Instruct	10.67	20	0	7	37	0	0	29.75	81	9	60	64	4	12	0	8	21.57
Gemma-3-12b-it	20.33	83	0	5	34	0	0	13.88	32	28	36	1	1	4	0	9	16.64
Llama-3.2-11B-Vision-Instruct	6.50	1	0	23	15	0	0	20.63	68	5	8	69	1	14	0	0	14.57
InternVL3-8B	13.83	55	0	16	3	6	3	10.38	0	1	71	0	0	11	0	0	11.86
InternVL2.5-8B	2.67	2	0	13	1	0	0	1.25	0	9	0	0	1	0	0	0	1.86
Qwen2.5-VL-7B-Instruct	1.67	3	0	4	2	1	0	1.25	1	0	6	0	3	0	0	0	1.43

Table 2: Foundational Spatial Grounding Results

Success scores across three scenario settings: Sparse, Dense, and Cluttered. "Avg." represents the overall mean performance.

Models	Task scenario settings			Avg.
Models	Sparse	Dense	Cluttered	Avg.
Gemini-2.0-flash	95.45	98.69	92.00	95.38
Gemini-2.5-flash	95.77	96.76	92.88	95.13
Gemini-2.5-pro	96.14	96.77	92.12	95.01
Claude-sonnet-4.5	96.12	94.78	92.23	94.38
GPT-5	94.73	95.13	92.97	94.28
GLM-4.5V	91.48	97.77	93.00	94.08
Qwen3-VL-32B-Instruct	94.47	95.77	91.77	94.00
Claude-sonnet-4	94.13	94.46	92.88	93.82
Claude-sonnet-3.7	93.46	95.11	91.94	93.51
InternVL3-78B	92.80	97.07	90.16	93.34
Ovis2-34B	94.78	92.78	90.45	92.67
GPT-4.1	93.43	92.48	91.76	92.55
Ovis2-16B	94.07	91.74	88.00	91.27
Qwen3-VL-235B-A22B-Instruct	86.82	93.50	90.33	90.22
InternVL3.5-38B	89.48	91.45	86.75	89.23
GPT-4o	89.02	91.13	87.10	89.08
Qwen3-VL-30B-A3B-Instruct	85.50	91.13	88.98	88.54
InternVL3-38B	81.82	92.13	89.85	87.94
InternVL2.5-78B	87.21	86.45	89.37	87.68
Llama-4-Scout-17B-16E-Instruct	85.49	87.75	86.16	86.47
Gemma-3-27b-it	92.40	81.12	85.78	86.43
Qwen2.5-VL-32B-Instruct	85.16	86.08	87.38	86.21
InternVL2.5-38B	79.16	85.47	85.99	83.54
InternVL2.5-8B	87.48	78.96	81.81	82.75
InternVL3-8B	79.53	69.79	86.79	78.70
Ovis2.5-9B	72.79	78.12	73.13	74.68
Qwen2.5-VL-7B-Instruct	75.20	65.83	79.34	73.46
Gemma-3-12b-it	80.09	57.17	70.22	69.16
Llama-3.2-11B-Vision-Instruct	54.64	53.62	54.01	54.09

Models	Tasks					Avg.
Models	Place8	Place9	Place10	Grab1	Stack3	Avg.
GPT-5	66	83	50	79	56	66.80
Gemini-2.5-Pro	82	61	39	81	38	60.20
Gemini-2.5-flash	74	48	13	84	49	53.60
InternVL3-78B	8	50	0	79	1	27.60
Claude-sonnet-4.5	17	13	6	89	2	25.40
Qwen3-VL-235B-A22B-Instruct	41	28	9	46	2	25.20
Gemma-3-27b-it	8	13	3	7	0	6.20
Llama-4-Scout-17B-16E-Instruct	1	0	0	29	0	6.00

Error Analysis

(a) GPT-5

(b) Gemini-2.5-Pro

Comparison of error type distributions. Visualization of failure modes for (a) GPT-5 and (b) Gemini-2.5-Pro. Inner rings represent primary error categories (Perceptual vs. Planning), while outer rings detail specific failure modes such as misjudgment or sequencing errors. Detailed definitions are provided in the Appendix.

We analyzed failure modes for GPT-5 and Gemini-2.5-Pro, excluding environmental noise. As illustrated in Figure 1, the primary bottleneck for GPT-5 is perceptual (54%), largely driven by Task State Estimation Misjudgment (39%). Furthermore, it exhibits a notable inability to strictly adhere to prompt-specified execution parameters, categorized as Action Parameter Inconsistency (23%).

Conversely, while Gemini-2.5-Pro follows prompt constraints more reliably, it is significantly more limited by complex planning logic (56%). Its main hurdles are Action Sequencing (31%) and Bimanual Conflict (24%), indicating deeper struggles with the temporal and spatial synchronization essential for sophisticated dual-arm coordination.

BibTeX

@article{wu2026ststbibench,
  title     = {ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs},
  author    = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
  journal   = {arXiv preprint arXiv:2602.08392},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.08392}
}

ST-BiBench
Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Abstract

Overview