Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints

Overview Video

OWL-TAMP operating on a mobile manipulator to solve the problem of "Put the orange where the apple on the table where the apple initially is". The current skill being executed, as well as any VLM-generated continuous constraints on its output, are overlayed on the video.


Approach Overview

pipeline figure

Leveraging Vision-Language Models (VLMs) within TAMP. We solve open-world long-horizon manipulation problems by leveraging a VLM to generate both discrete and continuous constraints for a task and motion planning (TAMP) system. Our approach is to first generate a partial plan skeleton (sequence of discrete action choices with no continuous parameters) using a VLM. The VLM also associates each action in this sequence with some language (''place the orange where the apple is'') that describes what the action should accomplish. We then have the VLM generate continuous constraint functions for each action subject to the language description (e.g. testing that the orange is actually places within the bounding box of the apple's original pose). We pass these constraints to a TAMP system, which produces a full plan that respects built-in robotic constraints (e.g. inverse kinematics and reachability, collision avoidance) as well as the VLM-generated constraints to solve the task.

Real-World Demos

Each video below shows our method solving a different real-world manipulation task. For each task, we simply changed the natural-language goal and the set of objects in front of the robot. Our system automatically runs perception to identify and segment objects before calling OWL-TAMP. All videos are shown at 8x speed.

Task 1: Align Blocks

Goal Description: "Place the red block so that it's aligned with the other two blocks"

Task 2: Apple Orange Placement 1

Goal Description: "Put the orange on the far right of the table and the apple on the far left."

Task 3: Apple Orange Placement 2

Goal Description: "Put the orange and apple on the plate."

Task 4: Apple Orange Placement 3

Goal Description: "Put the apple left of the plate and the orange on the table surface behind of the plate."

Task 5: Between Blocks

Goal Description: "Put the green block between the blue and red one."

Task 6: Block Stacking

Goal Description: "Stack the blocks into a tower by increasing hue."

Task 7: Blue Plate Arrangement

Goal Description: "Put the blue block onto the plate"

Task 8: Brownie Placement

Goal Description: "Put the brownie ingredients in front of the pan"

Task 9: Clean Plate

Goal Description: "Clean the plate"

Task 10: Fit in Mug

Goal Description: "Fit one of the fruit in the cup"

Task 11: Fruit Near

Goal Description: "Put the banana near the other fruit"

Task 12: Fry Eggs

Goal Description: "Fry two eggs at the front of the pan"

Task 13: Fry Spam

Goal Description: "Fry the spam on the pan and serve it on the plate"

Task 14: Plate Setup

Goal Description: "Setup the cutlery for someone to eat a meal from the plate. All the cutlery should be close to and lined-up with the plate, and should be oriented so each is straight and facing forwards, though you should pick which side of the plate each of the items are on"

Task 15: Fruit Sorting

Goal Description: "Place the strawberry and lime each in the bin that matches their color."

Task 16: Trash Non-Vegan

Goal Description: "Throw away anything not vegan in the purple bin"

Task 17: Utensil Holder

Goal Description: "Place the cutlery in the utensil holder. All the cutlery should be oriented straight and facing forward"

Task 18: Weigh Object

Goal Description: "Weigh the shortest object and put it in the bin"

Simulation Results

The videos below show our approach operating in simulation to solve a variety of manipulation tasks. All videos are shown at 1x speed.

Berry1

Goal Description: "Put the strawberry onto the light-grey region at the center of the table"

Citrus

Goal Description: "Pack the citrus fruit onto the plate"

Berry2

Goal Description: "Put the strawberry onto the light-grey region at the center of the table"

BerryCook

Goal Description: "Cook the strawberry by putting it in the pan, then finally simply place it in the bowl. The strawberry should only be in the bowl at the end!"

FruitSort

Goal Description: "Put all the fruit to the left of the line bisecting the table"

Coffee

Goal Description: "I want to pour some coffee into the cup; can you set up the cup on the table so I can do this properly?"

Mug1

Goal Description: "Setup the mug so it's upright, then put whatever object that fits inside of it"

Mug2

Goal Description: "Place cutlery inside the mug and then place the mug itself on the table near the condiment"

Mug3

Goal Description: "Place cutlery inside the mug and then place the mug itself on the table near the condiment"

SoupPour

Goal Description: "Serve the fruits on the white mat (make sure the peach is to the right of the apple) and pour soup into the red container"

Citation

@article{kumar2024owltamp,
    title={Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints},
    author={Kumar, Nishanth and Shen, William and Ramos, Fabio and Fox, Dieter and Lozano-P{\'e}rez, Tom{\'a}s and Kaelbling, Leslie Pack and Garrett, Caelan Reed},
    journal={arXiv preprint arXiv:2411.08253},
    year={2024}
}

February 2025 — Licensed under a Creative Commons Attribution 4.0 International License.