PWM: Policy Learning with Large World Models

1Georgia Institute of Technology, 2UC San Diego, 3Nvidia


Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning.


Method overview

We introduce Policy learning with large World Models (PWM), a novel Model-Based RL (MBRL) algorithm and framework aimed at deriving effective continuous control policies from large, multi-task world models. We utilize pre-trained TD-MPC2 world models to efficiently learn control policies with first-order gradients in < 10m per task. Our empirical evaluations on complex locomotion tasks indicate that PWM not only achieves higher reward than baselines but also outperforms methods that use ground-truth simulation dynamics.
PWM teaser results
We evaluate PWM on high-dimensional continuous control tasks (left figure) and find that it not only outperforms model-free baselines SAC and PPO but also achieves higher rewards than SHAC, a method using the dynamics and reward function of the simulator directly. In an 80-task setting (right figure) using a large 48M-parameter world model, PWM is able to consistently outperform TD-MPC2, an MBRL method that uses the same world model, without the need for online planning.

Single-task results

agg results
The figure shows 50% IQM with solid lines, mean with dashed lines, and 95% CI over all 5 tasks and 5 random seeds. PWM is able to achieve a higher reward than model-free baselines PPO and SAC, TD-MPC2, which uses the same world model as PWM and SHAC which uses the ground-truth dynamics and reward functions of the simulator. These results indicate that well-regularized world models can smooth out the optimization landscape, allowing for better first-order gradient optimization.

Multi-task results

Full multi-task results
The figure shows the performance of PWM and TD-MPC2 on 30 and 80 multi-task benchmarks with results over 10 random seeds. PWM is able to outperform TD-MPC2 while using the same world model without any form of online planning, making it the more scalable approach to large world models. The right figure compares PWM, a multi-task policy, with single-task experts SAC and DreamerV3. It is impressive that PWM is able to match their performance while being multi-task and only trained on offline data.


@misc{georgiev2024pwm, title={PWM: Policy Learning with Large World Models}, author={Ignat Georgiev, Varun Giridhar, Nicklas Hansen, and Animesh Garg}, eprint={2407.02466}, archivePrefix={arXiv}, primaryClass={cs.LG}, year={2024} }