PRISM: A Vision-Language Framework for Hierarchical Multi-Agent Collaboration

Code

Abstract

Multi-agent systems hold the promise of distributed, parallel execution in domains ranging from manufacturing to assistive care. However, enabling multiple agents to follow natural language instructions in dynamic environments remains a challenging problem due to the lack of mechanisms for grounding language into temporally synchronized, vision-aware action plans. To overcome these limitations and enable robust coordination among multiple agents, we present a hierarchical vision-language framework that grounds natural language instructions into synchronized multi-agent workflows by dynamically injecting synchronization constraints via predicate reasoning over multimodal observations and execution history. Additionally, to enable systematic evaluation of multi-agent coordination, we introduce CoRL-Bench, an extended benchmark comprising ten two-robot manipulation tasks spanning sequential coordination, parallel coordination, coupled interaction, and behavior-aware reasoning. Our extensive experiments suggest that the proposed method significantly outperforms state-of-the-art planning frameworks, achieving an average task success rate of 72% and an average subtask success rate of 89%. Our comprehensive evaluation demonstrates that PRISM bridges the gap between adaptability and temporal coordination in multi-agent systems.

Real-World Experiments

Representative real-robot executions for different tasks.

Stack the red and yellow blocks on the stacking plane
Put the red block on the drawer
Handover the green block to the other robot
Sort the green and yellow blocks on the trays

Simulation Results

Simulation in RLBench

CoRL-Bench task suite used for evaluation.

Successful Task Execution Examples

Stacking blocks
Stacking blocks
Stacking a pyramid
Stacking a pyramid
Putting in the drawer
Putting in the drawer
Putting on the saucepan
Putting on the saucepan
Pushing buttons
Pushing buttons
The shell game
The shell game
Sorting items
Sorting items
Handing over an item
Handing over an item
Inserting rings
Inserting rings
Pushing the box to a target
Pushing a box to a target
Stacking blocks
Stacking blocks
Stacking a pyramid
Stacking a pyramid

Quantitative Results

Task Success (TS↑), #Subtasks (↑), Subtask Success (STS↑)

Model Sequential Coordination Coupled Interaction
Stacking blocks Stacking a pyramid Putting on the saucepan Pushing the box to a target Putting in the drawer
TS#STSTS TS#STSTS TS#STSTS TS#STSTS TS#STSTS
Centralized Planner (Oracle) 0.8010.91.00 0.6011.00.90 0.808.00.93 1.0016.01.00 1.0010.01.00
VoxPoser 0.0010.00.68 0.0011.80.36 0.507.00.79 0.0011.20.76 0.009.00.55
PRISM (Ours) 0.8011.00.94 0.6011.00.83 0.708.00.81 0.7014.00.72 0.508.00.82

Model Parallel Coordination Behavior-Aware Reasoning
Sorting items Pushing buttons Inserting rings The shell game Handing over an item
TS#STSTS TS#STSTS TS#STSTS TS#STSTS TS#STSTS
Centralized Planner (Oracle) 0.9010.00.99 1.008.01.00 1.008.01.00 0.909.40.98 1.0012.01.00
VoxPoser 0.4010.20.74 0.308.00.40 0.0012.80.13 0.0014.50.58 0.0010.00.46
PRISM (Ours) 0.9010.00.98 1.008.01.00 0.6015.00.95 0.7013.00.96 0.7012.00.91