Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints
OWL-TAMP operating on a mobile manipulator to solve the problem of "Put the orange where the apple on the table where the apple initially is". The current skill being executed, as well as any VLM-generated continuous constraints on its output, are overlayed on the video.
Approach Overview

Leveraging Vision-Language Models (VLMs) within TAMP. We solve open-world long-horizon manipulation problems by leveraging a VLM to generate both discrete and continuous constraints for a task and motion planning (TAMP) system. Our approach is to first generate a partial plan skeleton (sequence of discrete action choices with no continuous parameters) using a VLM. The VLM also associates each action in this sequence with some language (''place the orange where the apple is'') that describes what the action should accomplish. We then have the VLM generate continuous constraint functions for each action subject to the language description (e.g. testing that the orange is actually places within the bounding box of the apple's original pose). We pass these constraints to a TAMP system, which produces a full plan that respects built-in robotic constraints (e.g. inverse kinematics and reachability, collision avoidance) as well as the VLM-generated constraints to solve the task.
Real-World Demos
Each video below shows our method solving a different real-world manipulation task. For each task, we simply changed the natural-language goal and the set of objects in front of the robot. Our system automatically runs perception to identify and segment objects before calling OWL-TAMP. All videos are shown at 8x speed.
Goal Description: "Place the red block so that it's aligned with the other two blocks"
Goal Description: "Put the orange on the far right of the table and the apple on the far left."
Goal Description: "Put the orange and apple on the plate."
Goal Description: "Put the apple left of the plate and the orange on the table surface behind of the plate."
Goal Description: "Put the green block between the blue and red one."
Goal Description: "Stack the blocks into a tower by increasing hue."
Goal Description: "Put the blue block onto the plate"
Goal Description: "Put the brownie ingredients in front of the pan"
Goal Description: "Clean the plate"
Goal Description: "Fit one of the fruit in the cup"
Goal Description: "Put the banana near the other fruit"
Goal Description: "Fry two eggs at the front of the pan"
Goal Description: "Fry the spam on the pan and serve it on the plate"
Goal Description: "Setup the cutlery for someone to eat a meal from the plate. All the cutlery should be close to and lined-up with the plate, and should be oriented so each is straight and facing forwards, though you should pick which side of the plate each of the items are on"
Goal Description: "Place the strawberry and lime each in the bin that matches their color."
Goal Description: "Throw away anything not vegan in the purple bin"
Goal Description: "Place the cutlery in the utensil holder. All the cutlery should be oriented straight and facing forward"
Goal Description: "Weigh the shortest object and put it in the bin"
Simulation Results
The videos below show our approach operating in simulation to solve a variety of manipulation tasks. All videos are shown at 1x speed.
Goal Description: "Put the strawberry onto the light-grey region at the center of the table"
Goal Description: "Pack the citrus fruit onto the plate"
Goal Description: "Put the strawberry onto the light-grey region at the center of the table"
Goal Description: "Cook the strawberry by putting it in the pan, then finally simply place it in the bowl. The strawberry should only be in the bowl at the end!"
Goal Description: "Put all the fruit to the left of the line bisecting the table"
Goal Description: "I want to pour some coffee into the cup; can you set up the cup on the table so I can do this properly?"
Goal Description: "Setup the mug so it's upright, then put whatever object that fits inside of it"
Goal Description: "Place cutlery inside the mug and then place the mug itself on the table near the condiment"
Goal Description: "Place cutlery inside the mug and then place the mug itself on the table near the condiment"
Goal Description: "Serve the fruits on the white mat (make sure the peach is to the right of the apple) and pour soup into the red container"
Citation
@article{kumar2024owltamp, title={Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints}, author={Kumar, Nishanth and Shen, William and Ramos, Fabio and Fox, Dieter and Lozano-P{\'e}rez, Tom{\'a}s and Kaelbling, Leslie Pack and Garrett, Caelan Reed}, journal={arXiv preprint arXiv:2411.08253}, year={2024} }
February 2025 — Licensed under a Creative Commons Attribution 4.0 International License.