Multi-agent systems hold the promise of distributed, parallel execution in domains ranging from manufacturing to assistive care. However, enabling multiple agents to follow natural language instructions in dynamic environments remains a challenging problem due to the lack of mechanisms for grounding language into temporally synchronized, vision-aware action plans. To overcome these limitations and enable robust coordination among multiple agents, we present a hierarchical vision-language framework that grounds natural language instructions into synchronized multi-agent workflows by dynamically injecting synchronization constraints via predicate reasoning over multimodal observations and execution history. Additionally, to enable systematic evaluation of multi-agent coordination, we introduce CoRL-Bench, an extended benchmark comprising ten two-robot manipulation tasks spanning sequential coordination, parallel coordination, coupled interaction, and behavior-aware reasoning. Our extensive experiments suggest that the proposed method significantly outperforms state-of-the-art planning frameworks, achieving an average task success rate of 72% and an average subtask success rate of 89%. Our comprehensive evaluation demonstrates that PRISM bridges the gap between adaptability and temporal coordination in multi-agent systems.
Representative real-robot executions for different tasks.
CoRL-Bench task suite used for evaluation.
Task Success (TS↑), #Subtasks (↑), Subtask Success (STS↑)
| Model | Sequential Coordination | Coupled Interaction | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Stacking blocks | Stacking a pyramid | Putting on the saucepan | Pushing the box to a target | Putting in the drawer | |||||||||||
| TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | |
| Centralized Planner (Oracle) | 0.80 | 10.9 | 1.00 | 0.60 | 11.0 | 0.90 | 0.80 | 8.0 | 0.93 | 1.00 | 16.0 | 1.00 | 1.00 | 10.0 | 1.00 |
| VoxPoser | 0.00 | 10.0 | 0.68 | 0.00 | 11.8 | 0.36 | 0.50 | 7.0 | 0.79 | 0.00 | 11.2 | 0.76 | 0.00 | 9.0 | 0.55 |
| PRISM (Ours) | 0.80 | 11.0 | 0.94 | 0.60 | 11.0 | 0.83 | 0.70 | 8.0 | 0.81 | 0.70 | 14.0 | 0.72 | 0.50 | 8.0 | 0.82 |
| Model | Parallel Coordination | Behavior-Aware Reasoning | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sorting items | Pushing buttons | Inserting rings | The shell game | Handing over an item | |||||||||||
| TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | TS | #ST | STS | |
| Centralized Planner (Oracle) | 0.90 | 10.0 | 0.99 | 1.00 | 8.0 | 1.00 | 1.00 | 8.0 | 1.00 | 0.90 | 9.4 | 0.98 | 1.00 | 12.0 | 1.00 |
| VoxPoser | 0.40 | 10.2 | 0.74 | 0.30 | 8.0 | 0.40 | 0.00 | 12.8 | 0.13 | 0.00 | 14.5 | 0.58 | 0.00 | 10.0 | 0.46 |
| PRISM (Ours) | 0.90 | 10.0 | 0.98 | 1.00 | 8.0 | 1.00 | 0.60 | 15.0 | 0.95 | 0.70 | 13.0 | 0.96 | 0.70 | 12.0 | 0.91 |