A wide range of language-specified manipulation queries can be solved through object grounding, spatial reasoning, and affordance-level interactions. ASP is designed around a compact set of scene querying tools implementing
these functionalities. Tools are functions that can be called by an LLM agent, and their outputs are fed back to the agent to inform the next tool call during execution.
All tools are implemented using an object-centric scene representation built from RGB-D data:
- The object retrieval tool combines CLIP-based similarities and a VLM classification step to retrieve multiple relevant objects per query (if needed). Grounded objects are added to an inventory managed by the LLM agent.
- Spatial reasoning tools allow to check sizes, distances and spatial predicates (e.g., is left of) in the inventory.
- The interact tool takes as input an object and a language description of the the action the agent aims to achieve. Internally, interact processes the action description and object views to predict affordances (when needed) and the relevant skill to be called.
Our skills are implemented using motion planning and do not require demonstrations, enabling strong zero-shot performance.