Adds distributed multi-GPU learning support for skrl (#574)

This PR updates `skrl` to version `>=1.2.0` and include the support for distributed multi-gpu learning in Isaac Lab docs, See [skrl-v1.2.0](https://github.com/Toni-SM/skrl/releases/tag/1.2.0) for more details ## Type of change - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have run all the tests with `./isaaclab.sh --test` and they pass - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Adds distributed multi-GPU learning support for skrl (#574)
This PR updates `skrl` to version `>=1.2.0` and include the support for distributed multi-gpu learning in Isaac Lab docs, See [skrl-v1.2.0](https://github.com/Toni-SM/skrl/releases/tag/1.2.0) for more details ## Type of change - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have run all the tests with `./isaaclab.sh --test` and they pass - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
35345c63 · Toni-SM · GitHub · 1a7c86b9 · 35345c63 · 35345c63
Unverified Commit 35345c63 authored Jun 27, 2024 by Toni-SM Committed by GitHub Jun 27, 2024
7 changed files
--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -10,7 +10,7 @@ Multi-GPU Training
 ------------------
 For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs.
-This is possible in Isaac Lab with the ``rl_games`` RL library through the use of the
+This is possible in Isaac Lab with the ``rl_games`` and ``skrl`` RL libraries through the use of the
 `PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_ framework.
 In this workflow, ``torch.distributed`` is used to launch multiple processes of training, where the number of
 processes must be equal to or less than the number of GPUs available. Each process runs on
@@ -23,12 +23,23 @@ at the end of the epoch.
    :align: center
    :alt: Multi-GPU training paradigm
+|
 To train with multiple GPUs, use the following command, where ``--proc_per_node`` represents the number of available GPUs:
-.. code-block:: shell
+.. tabs::
-    python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+    .. group-tab:: rl_games
+        .. code-block:: shell
+            python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+    .. group-tab:: skrl
+        .. code-block:: shell
+            python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
 Due to limitations of NCCL on Windows, this feature is currently supported on Linux only.
@@ -41,17 +52,37 @@ To scale up training beyond multiple GPUs on a single machine, it is also possib
 To train across multiple nodes/machines, it is required to launch an individual process on each node.
 For the master node, use the following command, where ``--proc_per_node`` represents the number of available GPUs, and ``--nnodes`` represents the number of nodes:
-.. code-block:: shell
+.. tabs::
+    .. group-tab:: rl_games
+        .. code-block:: shell
+            python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
-    python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+    .. group-tab:: skrl
+        .. code-block:: shell
+            python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
 Note that the port (``5555``) can be replaced with any other available port.
 For non-master nodes, use the following command, replacing ``--node_rank`` with the index of each machine:
-.. code-block:: shell
+.. tabs::
+    .. group-tab:: rl_games
+        .. code-block:: shell
+            python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+    .. group-tab:: skrl
+        .. code-block:: shell
-    python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
+            python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 source/standalone/workflows/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
 For more details on multi-node training with PyTorch, please visit the `PyTorch documentation <https://pytorch.org/tutorials/intermediate/ddp_series_multinode.html>`_. As mentioned in the PyTorch documentation, "multinode training is bottlenecked by inter-node communication latencies". When this latency is high, it is possible multi-node training will perform worse than running on a single node instance.

--- a/source/extensions/omni.isaac.lab_tasks/config/extension.toml
+++ b/source/extensions/omni.isaac.lab_tasks/config/extension.toml
 [package]
 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.7.7"
+version = "0.7.8"
 # Description
 title = "Isaac Lab Environments"
@@ -22,7 +22,7 @@ requirements = [
    "stable-baselines3>=2.1",
    "rl-games==1.6.1",
    "rsl-rl@git+https://github.com/leggedrobotics/rsl_rl.git",
-    "skrl>=1.1.0"
+    "skrl>=1.2.0"
 ]
 modules = [

--- a/source/extensions/omni.isaac.lab_tasks/docs/CHANGELOG.rst
+++ b/source/extensions/omni.isaac.lab_tasks/docs/CHANGELOG.rst
 Changelog
 ---------
+0.7.8 (2024-06-26)
+~~~~~~~~~~~~~~~~~~
+Changed
+^^^^^^^
+* Updated the skrl RL library integration to the latest release (>= 1.2.0)
 0.7.7 (2024-06-14)
 ~~~~~~~~~~~~~~~~~~

--- a/source/extensions/omni.isaac.lab_tasks/omni/isaac/lab_tasks/utils/wrappers/skrl.py
+++ b/source/extensions/omni.isaac.lab_tasks/omni/isaac/lab_tasks/utils/wrappers/skrl.py
@@ -19,23 +19,16 @@ Or, equivalently, by directly calling the skrl library API as follows:
    from skrl.envs.torch.wrappers import wrap_env
-    env = wrap_env(env, wrapper="isaac-orbit")
+    env = wrap_env(env, wrapper="isaaclab")
 """
 # needed to import for type hinting: Agent | list[Agent]
 from __future__ import annotations
-import copy
+from skrl.envs.wrappers.torch import wrap_env
-import torch
-import tqdm
-from skrl.agents.torch import Agent
-from skrl.envs.wrappers.torch import Wrapper, wrap_env
 from skrl.resources.preprocessors.torch import RunningStandardScaler  # noqa: F401
 from skrl.resources.schedulers.torch import KLAdaptiveLR  # noqa: F401
-from skrl.trainers.torch import Trainer
-from skrl.trainers.torch.sequential import SEQUENTIAL_TRAINER_DEFAULT_CONFIG
 from skrl.utils.model_instantiators.torch import Shape  # noqa: F401
 from omni.isaac.lab.envs import DirectRLEnv, ManagerBasedRLEnv
@@ -114,173 +107,4 @@ def SkrlVecEnvWrapper(env: ManagerBasedRLEnv):
            f"The environment must be inherited from ManagerBasedRLEnv or DirectRLEnv. Environment type: {type(env)}"
        )
    # wrap and return the environment
-    return wrap_env(env, wrapper="isaac-orbit")
+    return wrap_env(env, wrapper="isaaclab")
-"""
-Custom trainer for skrl.
-"""
-class SkrlSequentialLogTrainer(Trainer):
-    """Sequential trainer with logging of episode information.
-    This trainer inherits from the :class:`skrl.trainers.base_class.Trainer` class. It is used to
-    train agents in a sequential manner (i.e., one after the other in each interaction with the
-    environment). It is most suitable for on-policy RL agents such as PPO, A2C, etc.
-    It modifies the :class:`skrl.trainers.torch.sequential.SequentialTrainer` class with the following
-    differences:
-    * It also log episode information to the agent's logger.
-    * It does not close the environment at the end of the training.
-    Reference:
-        https://skrl.readthedocs.io/en/latest/api/trainers.html#base-class
-    """
-    def __init__(
-        self,
-        env: Wrapper,
-        agents: Agent | list[Agent],
-        agents_scope: list[int] | None = None,
-        cfg: dict | None = None,
-    ):
-        """Initializes the trainer.
-        Args:
-            env: Environment to train on.
-            agents: Agents to train.
-            agents_scope: Number of environments for each agent to
-                train on. Defaults to None.
-            cfg: Configuration dictionary. Defaults to None.
-        """
-        # update the config
-        _cfg = copy.deepcopy(SEQUENTIAL_TRAINER_DEFAULT_CONFIG)
-        _cfg.update(cfg if cfg is not None else {})
-        # store agents scope
-        agents_scope = agents_scope if agents_scope is not None else []
-        # initialize the base class
-        super().__init__(env=env, agents=agents, agents_scope=agents_scope, cfg=_cfg)
-        # init agents
-        if self.env.num_agents > 1:
-            for agent in self.agents:
-                agent.init(trainer_cfg=self.cfg)
-        else:
-            self.agents.init(trainer_cfg=self.cfg)
-    def train(self):
-        """Train the agents sequentially.
-        This method executes the training loop for the agents. It performs the following steps:
-        * Pre-interaction: Perform any pre-interaction operations.
-        * Compute actions: Compute the actions for the agents.
-        * Step the environments: Step the environments with the computed actions.
-        * Record the environments' transitions: Record the transitions from the environments.
-        * Log custom environment data: Log custom environment data.
-        * Post-interaction: Perform any post-interaction operations.
-        * Reset the environments: Reset the environments if they are terminated or truncated.
-        """
-        # init agent
-        self.agents.init(trainer_cfg=self.cfg)
-        self.agents.set_running_mode("train")
-        # reset env
-        states, infos = self.env.reset()
-        # training loop
-        for timestep in tqdm.tqdm(range(self.timesteps), disable=self.disable_progressbar):
-            # pre-interaction
-            self.agents.pre_interaction(timestep=timestep, timesteps=self.timesteps)
-            # compute actions
-            with torch.no_grad():
-                actions = self.agents.act(states, timestep=timestep, timesteps=self.timesteps)[0]
-            # step the environments
-            next_states, rewards, terminated, truncated, infos = self.env.step(actions)
-            # note: here we do not call render scene since it is done in the env.step() method
-            # record the environments' transitions
-            with torch.no_grad():
-                self.agents.record_transition(
-                    states=states,
-                    actions=actions,
-                    rewards=rewards,
-                    next_states=next_states,
-                    terminated=terminated,
-                    truncated=truncated,
-                    infos=infos,
-                    timestep=timestep,
-                    timesteps=self.timesteps,
-                )
-            # log custom environment data
-            if "log" in infos:
-                for k, v in infos["log"].items():
-                    if isinstance(v, torch.Tensor) and v.numel() == 1:
-                        self.agents.track_data(f"EpisodeInfo / {k}", v.item())
-            # post-interaction
-            self.agents.post_interaction(timestep=timestep, timesteps=self.timesteps)
-            # reset the environments
-            # note: here we do not call reset scene since it is done in the env.step() method
-            # update states
-            states.copy_(next_states)
-    def eval(self) -> None:
-        """Evaluate the agents sequentially.
-        This method executes the following steps in loop:
-        * Compute actions: Compute the actions for the agents.
-        * Step the environments: Step the environments with the computed actions.
-        * Record the environments' transitions: Record the transitions from the environments.
-        * Log custom environment data: Log custom environment data.
-        """
-        # set running mode
-        if self.num_agents > 1:
-            for agent in self.agents:
-                agent.set_running_mode("eval")
-        else:
-            self.agents.set_running_mode("eval")
-        # single agent
-        if self.num_agents == 1:
-            self.single_agent_eval()
-            return
-        # reset env
-        states, infos = self.env.reset()
-        # evaluation loop
-        for timestep in tqdm.tqdm(range(self.initial_timestep, self.timesteps), disable=self.disable_progressbar):
-            # compute actions
-            with torch.no_grad():
-                actions = torch.vstack([
-                    agent.act(states[scope[0] : scope[1]], timestep=timestep, timesteps=self.timesteps)[0]
-                    for agent, scope in zip(self.agents, self.agents_scope)
-                ])
-            # step the environments
-            next_states, rewards, terminated, truncated, infos = self.env.step(actions)
-            with torch.no_grad():
-                # write data to TensorBoard
-                for agent, scope in zip(self.agents, self.agents_scope):
-                    # track data
-                    agent.record_transition(
-                        states=states[scope[0] : scope[1]],
-                        actions=actions[scope[0] : scope[1]],
-                        rewards=rewards[scope[0] : scope[1]],
-                        next_states=next_states[scope[0] : scope[1]],
-                        terminated=terminated[scope[0] : scope[1]],
-                        truncated=truncated[scope[0] : scope[1]],
-                        infos=infos,
-                        timestep=timestep,
-                        timesteps=self.timesteps,
-                    )
-                    # log custom environment data
-                    if "log" in infos:
-                        for k, v in infos["log"].items():
-                            if isinstance(v, torch.Tensor) and v.numel() == 1:
-                                agent.track_data(k, v.item())
-                    # perform post-interaction
-                    super(type(agent), agent).post_interaction(timestep=timestep, timesteps=self.timesteps)
-                # reset environments
-                # note: here we do not call reset scene since it is done in the env.step() method
-                states.copy_(next_states)
--- a/source/extensions/omni.isaac.lab_tasks/setup.py
+++ b/source/extensions/omni.isaac.lab_tasks/setup.py
@@ -39,7 +39,7 @@ PYTORCH_INDEX_URL = ["https://download.pytorch.org/whl/cu118"]
 # Extra dependencies for RL agents
 EXTRAS_REQUIRE = {
    "sb3": ["stable-baselines3>=2.1"],
-    "skrl": ["skrl>=1.1.0"],
+    "skrl": ["skrl>=1.2.0"],
    "rl-games": ["rl-games==1.6.1", "gym"],  # rl-games still needs gym :(
    "rsl-rl": ["rsl-rl@git+https://github.com/leggedrobotics/rsl_rl.git"],
    "robomimic": [],

--- a/source/standalone/workflows/skrl/play.py
+++ b/source/standalone/workflows/skrl/play.py
@@ -60,7 +60,7 @@ def main():
    # create isaac environment
    env = gym.make(args_cli.task, cfg=env_cfg)
    # wrap around environment for skrl
-    env = SkrlVecEnvWrapper(env)  # same as: `wrap_env(env, wrapper="isaac-orbit")`
+    env = SkrlVecEnvWrapper(env)  # same as: `wrap_env(env, wrapper="isaaclab")`
    # instantiate models using skrl model instantiator utility
    # https://skrl.readthedocs.io/en/latest/api/utils/model_instantiators.html

--- a/source/standalone/workflows/skrl/train.py
+++ b/source/standalone/workflows/skrl/train.py
@@ -29,7 +29,11 @@ parser.add_argument(
 parser.add_argument("--num_envs", type=int, default=None, help="Number of environments to simulate.")
 parser.add_argument("--task", type=str, default=None, help="Name of the task.")
 parser.add_argument("--seed", type=int, default=None, help="Seed used for the environment")
+parser.add_argument(
+    "--distributed", action="store_true", default=False, help="Run training with multiple GPUs or nodes."
+)
 parser.add_argument("--max_iterations", type=int, default=None, help="RL Policy training iterations.")
 # append AppLauncher cli args
 AppLauncher.add_app_launcher_args(parser)
 # parse the arguments
@@ -50,6 +54,7 @@ from datetime import datetime
 from skrl.agents.torch.ppo import PPO, PPO_DEFAULT_CONFIG
 from skrl.memories.torch import RandomMemory
+from skrl.trainers.torch import SequentialTrainer
 from skrl.utils import set_seed
 from skrl.utils.model_instantiators.torch import deterministic_model, gaussian_model, shared_model
@@ -58,7 +63,7 @@ from omni.isaac.lab.utils.io import dump_pickle, dump_yaml
 import omni.isaac.lab_tasks  # noqa: F401
 from omni.isaac.lab_tasks.utils import load_cfg_from_registry, parse_env_cfg
-from omni.isaac.lab_tasks.utils.wrappers.skrl import SkrlSequentialLogTrainer, SkrlVecEnvWrapper, process_skrl_cfg
+from omni.isaac.lab_tasks.utils.wrappers.skrl import SkrlVecEnvWrapper, process_skrl_cfg
 def main():
@@ -86,6 +91,11 @@ def main():
    # update log_dir
    log_dir = os.path.join(log_root_path, log_dir)
+    # multi-gpu training config
+    if args_cli.distributed:
+        # update env config device
+        env_cfg.sim.device = f"cuda:{app_launcher.local_rank}"
    # max iterations for training
    if args_cli.max_iterations:
        experiment_cfg["trainer"]["timesteps"] = args_cli.max_iterations * experiment_cfg["agent"]["rollouts"]
@@ -110,7 +120,7 @@ def main():
        print_dict(video_kwargs, nesting=4)
        env = gym.wrappers.RecordVideo(env, **video_kwargs)
    # wrap around environment for skrl
-    env = SkrlVecEnvWrapper(env)  # same as: `wrap_env(env, wrapper="isaac-orbit")`
+    env = SkrlVecEnvWrapper(env)  # same as: `wrap_env(env, wrapper="isaaclab")`
    # set seed for the experiment (override from command line)
    set_seed(args_cli_seed if args_cli_seed is not None else experiment_cfg["seed"])
@@ -173,7 +183,8 @@ def main():
    # configure and instantiate a custom RL trainer for logging episode events
    # https://skrl.readthedocs.io/en/latest/api/trainers.html
    trainer_cfg = experiment_cfg["trainer"]
-    trainer = SkrlSequentialLogTrainer(cfg=trainer_cfg, env=env, agents=agent)
+    trainer_cfg["close_environment_at_exit"] = False
+    trainer = SequentialTrainer(cfg=trainer_cfg, env=env, agents=agent)
    # train the agent
    trainer.train()