Agentic Scene Policies: Unifying Space, Semantics, and Affordances for Robot Action

Abstract

Executing open-ended natural language queries is a core problem in robotics. While recent advances in imitation learning and vision-language-actions models (VLAs) have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. An alternative is to design an explicit scene representation as a queryable interface between the robot and the world, using query results to guide downstream motion planning. In this work, we present Agentic Scene Policies (ASP), an agentic framework that leverages the advanced semantic, spatial, and affordance-based querying capabilities of modern scene representations to implement a capable language-conditioned robot policy. ASP can execute open-vocabulary queries in a zero-shot manner by explicitly reasoning about object affordances in the case of more complex skills. Through extensive experiments, we compare ASP with VLAs on tabletop manipulation problems and showcase how ASP can tackle room-level queries through affordance-guided navigation, and a scaled-up scene representation.

Approach

A wide range of language-specified manipulation queries can be solved through object grounding, spatial reasoning, and affordance-level interactions. ASP is designed around a compact set of scene querying tools implementing these functionalities. Tools are functions that can be called by an LLM agent, and their outputs are fed back to the agent to inform the next tool call during execution.

All tools are implemented using an object-centric scene representation built from RGB-D data:

The object retrieval tool combines CLIP-based similarities and a VLM classification step to retrieve multiple relevant objects per query (if needed). Grounded objects are added to an inventory managed by the LLM agent.
Spatial reasoning tools allow to check sizes, distances and spatial predicates (e.g., is left of) in the inventory.
The interact tool takes as input an object and a language description of the the action the agent aims to achieve. Internally, interact processes the action description and object views to predict affordances (when needed) and the relevant skill to be called.

Our skills are implemented using motion planning and do not require demonstrations, enabling strong zero-shot performance.

Affordance detection

ASP relies on affordance detection to implement part-level interactions for more complex skills. Our workflow is based on foundation models with strong visual grounding abilities and is adept at mapping various queries to the relevant affordances types and object parts:

ring the desk bell → [tip_push] the bell button
pick up the mug → [grasp_part] the mug handle
open the drawer → [hook_pull] the drawer handle

Manipulation queries

By combing an LLM agent with scene query tools, ASP spans a wide range of manipulation behaviors. For example, ASP can map different queries and objects to the same [pinch_pull] skill:

remove the thumbtack from the board → [pinch_pull] the thumbtack
unplug the power adapter → [pinch_pull] the power adapter

Mobile manipulation queries

Our scene representations can be extended to rooms. With an additional go_to tool, ASP can solve mobile manipulation queries involving spatial reasoning and multiple open-vocabulary objects such as:

place the egg that is near the tomato in the pan
put the headphones and the play-doh in the box

Affordance-guided navigation

Similar to the interact tool, go_to takes as input a description of the intented action to detect affordances on the target object and inform navigation. This proves critical when facing a specific part is required for interaction such as:

dial a number on the phone → navigate to face keypad and [tip_push] the keypad
open the metal cabinet → navigate to face handle and [hook_pull] the handle

Comparison with π₀-FAST and π_0.5

We compare ASP with two state-of-the-art VLA models, π₀-FAST and π_0.5, on 15 manipulation tasks. We use the openpi DROID checkpoints out of the box.

While the π models perform well on some complex tasks, including double-grasp manoeuvers to open drawers and some thumbtack removal successes, their overall success rate appears to be limited by fine-grained gripper interactions. The policies often "hover" around the correct objects without completing the grasp or execute skills midair above the object (e.g., keyboard), suggesting high-level query understanding, but potential issues with depth understanding or excessive collision avoidance behaviors.

ASP Failures

The modular nature of ASP makes it possible to study specific failures modes. Most of our failures can be attributed to perception, including incorrect object retrieval or affordance detection. We provide a detailed breakdown of failure modes in the paper.