Automatic Reward Densification

A core challenge with scaling up Deep Reinforcement Learning (Deep RL) for use in robotic tasks of practical interest is the task specification problem, which typically manifests as the difficulty of reward design. In order to reduce the difficulty of reward function design in continuous robotics environments, we propose to develop a method that automatically densifies sparse, goal-based reward in robotic tasks such that the optimal policy is preserved by leveraging task plans.

We hypothesize that for many robotic tasks,

while it is difficult for humans to specify a dense reward that cannot be hacked, it is easy to specify an abstract plannable model in PDDL that conveys information about the dynamics of the domain, and
that valid abstract plans within this model can be leveraged to automatically densify sparse reward via potential-based reward shaping sufficiently enough for state-of-the-art RL approaches to solve these tasks.

We perform an extensive empirical evaluation of our system across different PDDL models with varying granularity, choices of potential function, choice of learning algorithm (PPO and SAC) and tasks.

Report | Presentation | Code

Environments

Reaching - PPO

Sparse Handcrafted		Dense Handcrafted
Plan Based - Single Subgoal	Plan Based - Multi Subgoal		Plan Based - Grid Based
Time Varying - Single Subgoal	Time Varying - Multi Subgoal		Time Varying - Grid Based
Distance Varying - Single Subgoal	Distance Varying - Multi Subgoal		Distance Varying - Grid Based

Reaching - SAC

Sparse Handcrafted		Dense Handcrafted
Plan Based - Single Subgoal	Plan Based - Multi Subgoal		Plan Based - Grid Based
Time Varying - Single Subgoal	Time Varying - Multi Subgoal		Time Varying - Grid Based
Distance Varying - Single Subgoal	Distance Varying - Multi Subgoal		Distance Varying - Grid Based

Pushing - PPO

Sparse Handcrafted		Dense Handcrafted
Plan Based - Single Subgoal	Plan Based - Multi Subgoal		Plan Based - Grid Based
Time Varying - Single Subgoal	Time Varying - Multi Subgoal		Time Varying - Grid Based
Distance Varying - Single Subgoal	Distance Varying - Multi Subgoal		Distance Varying - Grid Based

Pushing - SAC

Sparse Handcrafted		Dense Handcrafted
Plan Based - Single Subgoal	Plan Based - Multi Subgoal		Plan Based - Grid Based
Time Varying - Single Subgoal	Time Varying - Multi Subgoal		Time Varying - Grid Based
Distance Varying - Single Subgoal	Distance Varying - Multi Subgoal		Distance Varying - Grid Based

Maze-Reach - PPO

Sparse Handcrafted
Distance Varying - Single Subgoal		Distance Varying - Multi Subgoal		Distance Varying - Grid Based