Adds optimizations and additional training configs for SB3 (#2022)

# Description  Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change. Implement part of https://github.com/isaac-sim/IsaacLab/issues/1769 (optimization) This is a breaking change because the fast variant is now enabled by default. I also improve sb3 training script, fixed loading of normalization and fixed the humanoid hyperparameters to be similar to rsl-rl, so we can compare apples to apples in terms of training speed. I will probably open another PR for the rest of the proposals.  ## Type of change  - Bug fix (non-breaking change which fixes an issue) - Breaking change (fix or feature that would cause existing functionality to not work as expected) - This change requires a documentation update With respect to testing, how do you run a single test? and is there anything I should add? ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: Antonin RAFFIN <antonin.raffin@ensta.org> Signed-off-by: Kelly Guo <kellyguo123@hotmail.com> Co-authored-by: Kelly Guo <kellyguo123@hotmail.com>

Adds optimizations and additional training configs for SB3 (#2022)
# Description  Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change. Implement part of https://github.com/isaac-sim/IsaacLab/issues/1769 (optimization) This is a breaking change because the fast variant is now enabled by default. I also improve sb3 training script, fixed loading of normalization and fixed the humanoid hyperparameters to be similar to rsl-rl, so we can compare apples to apples in terms of training speed. I will probably open another PR for the rest of the proposals.  ## Type of change  - Bug fix (non-breaking change which fixes an issue) - Breaking change (fix or feature that would cause existing functionality to not work as expected) - This change requires a documentation update With respect to testing, how do you run a single test? and is there anything I should add? ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: Antonin RAFFIN <antonin.raffin@ensta.org> Signed-off-by: Kelly Guo <kellyguo123@hotmail.com> Co-authored-by: Kelly Guo <kellyguo123@hotmail.com>
ad14a674 · Antonin RAFFIN · GitHub · 9980e665 · ad14a674 · ad14a674
Unverified Commit ad14a674 authored Jun 25, 2025 by Antonin RAFFIN Committed by GitHub Jun 25, 2025
14 changed files
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -40,6 +40,7 @@ Guidelines for modifications:
 * Amr Mousa
 * Andrej Orsula
 * Anton Bjørndahl Mortensen
+* Antonin Raffin
 * Arjun Bhardwaj
 * Ashwin Varghese Kuruttukulam
 * Bikram Pandit

--- a/docs/source/overview/environments.rst
+++ b/docs/source/overview/environments.rst
@@ -884,7 +884,7 @@ Comprehensive List of Environments
    * - Isaac-Velocity-Flat-Unitree-A1-v0
      - Isaac-Velocity-Flat-Unitree-A1-Play-v0
      - Manager Based
-      - **rsl_rl** (PPO), **skrl** (PPO)
+      - **rsl_rl** (PPO), **skrl** (PPO), **sb3** (PPO)
    * - Isaac-Velocity-Flat-Unitree-Go1-v0
      - Isaac-Velocity-Flat-Unitree-Go1-Play-v0
      - Manager Based
@@ -924,7 +924,7 @@ Comprehensive List of Environments
    * - Isaac-Velocity-Rough-Unitree-A1-v0
      - Isaac-Velocity-Rough-Unitree-A1-Play-v0
      - Manager Based
-      - **rsl_rl** (PPO), **skrl** (PPO)
+      - **rsl_rl** (PPO), **skrl** (PPO), **sb3** (PPO)
    * - Isaac-Velocity-Rough-Unitree-Go1-v0
      - Isaac-Velocity-Rough-Unitree-Go1-Play-v0
      - Manager Based

--- a/docs/source/overview/reinforcement-learning/rl_existing_scripts.rst
+++ b/docs/source/overview/reinforcement-learning/rl_existing_scripts.rst
@@ -187,7 +187,7 @@ Stable-Baselines3
 -  Training an agent with
   `Stable-Baselines3 <https://stable-baselines3.readthedocs.io/en/master/index.html>`__
-   on ``Isaac-Cartpole-v0``:
+   on ``Isaac-Velocity-Flat-Unitree-A1-v0``:
   .. tab-set::
      :sync-group: os
@@ -200,14 +200,13 @@ Stable-Baselines3
            # install python module (for stable-baselines3)
            ./isaaclab.sh -i sb3
            # run script for training
-            # note: we set the device to cpu since SB3 doesn't optimize for GPU anyway
+            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/train.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --headless
-            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/train.py --task Isaac-Cartpole-v0 --headless --device cpu
            # run script for playing with 32 environments
-            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Cartpole-v0 --num_envs 32 --checkpoint /PATH/TO/model.zip
+            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --num_envs 32 --checkpoint /PATH/TO/model.zip
            # run script for playing a pre-trained checkpoint with 32 environments
-            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Cartpole-v0 --num_envs 32 --use_pretrained_checkpoint
+            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --num_envs 32 --use_pretrained_checkpoint
            # run script for recording video of a trained agent (requires installing `ffmpeg`)
-            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Cartpole-v0 --headless --video --video_length 200
+            ./isaaclab.sh -p scripts/reinforcement_learning/sb3/play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --headless --video --video_length 200
      .. tab-item:: :icon:`fa-brands fa-windows` Windows
         :sync: windows
@@ -217,14 +216,13 @@ Stable-Baselines3
            :: install python module (for stable-baselines3)
            isaaclab.bat -i sb3
            :: run script for training
-            :: note: we set the device to cpu since SB3 doesn't optimize for GPU anyway
+            isaaclab.bat -p scripts\reinforcement_learning\sb3\train.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --headless
-            isaaclab.bat -p scripts\reinforcement_learning\sb3\train.py --task Isaac-Cartpole-v0 --headless --device cpu
            :: run script for playing with 32 environments
-            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Cartpole-v0 --num_envs 32 --checkpoint /PATH/TO/model.zip
+            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --num_envs 32 --checkpoint /PATH/TO/model.zip
            :: run script for playing a pre-trained checkpoint with 32 environments
-            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Cartpole-v0 --num_envs 32 --use_pretrained_checkpoint
+            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --num_envs 32 --use_pretrained_checkpoint
            :: run script for recording video of a trained agent (requires installing `ffmpeg`)
-            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Cartpole-v0 --headless --video --video_length 200
+            isaaclab.bat -p scripts\reinforcement_learning\sb3\play.py --task Isaac-Velocity-Flat-Unitree-A1-v0 --headless --video --video_length 200
 All the scripts above log the training progress to `Tensorboard`_ in the ``logs`` directory in the root of
 the repository. The logs directory follows the pattern ``logs/<library>/<task>/<date-time>``, where ``<library>``

--- a/docs/source/overview/reinforcement-learning/rl_frameworks.rst
+++ b/docs/source/overview/reinforcement-learning/rl_frameworks.rst
@@ -71,9 +71,12 @@ Training Performance
 --------------------
 We performed training with each RL library on the same ``Isaac-Humanoid-v0`` environment
-with ``--headless`` on a single RTX 4090 GPU
+with ``--headless`` on a single RTX 4090 GPU using 4096 environments
 and logged the total training time for 65.5M steps for each RL library.
+..
+  Note: SB3 need to be re-run (current number comes from a GeForce RTX 3070)
 +--------------------+-----------------+
 | RL Library         | Time in seconds |
 +====================+=================+
@@ -83,5 +86,5 @@ and logged the total training time for 65.5M steps for each RL library.
 +--------------------+-----------------+
 | RSL RL             | 207             |
 +--------------------+-----------------+
-| Stable-Baselines3  | 6320            |
+| Stable-Baselines3  | 550             |
 +--------------------+-----------------+
--- a/scripts/reinforcement_learning/rl_games/play.py
+++ b/scripts/reinforcement_learning/rl_games/play.py
@@ -188,7 +188,7 @@ def main():
                        s[:, dones, :] = 0.0
        if args_cli.video:
            timestep += 1
-            # Exit the play loop after recording one video
+            # exit the play loop after recording one video
            if timestep == args_cli.video_length:
                break

--- a/scripts/reinforcement_learning/sb3/play.py
+++ b/scripts/reinforcement_learning/sb3/play.py
@@ -8,6 +8,7 @@
 """Launch Isaac Sim Simulator first."""
 import argparse
+from pathlib import Path
 from isaaclab.app import AppLauncher
@@ -32,6 +33,12 @@ parser.add_argument(
    help="When no checkpoint provided, use the last saved model. Otherwise use the best saved model.",
 )
 parser.add_argument("--real-time", action="store_true", default=False, help="Run in real-time, if possible.")
+parser.add_argument(
+    "--keep_all_info",
+    action="store_true",
+    default=False,
+    help="Use a slower SB3 wrapper but keep all the extra training info.",
+)
 # append AppLauncher cli args
 AppLauncher.add_app_launcher_args(parser)
 # parse the arguments
@@ -47,7 +54,6 @@ simulation_app = app_launcher.app
 """Rest everything follows."""
 import gymnasium as gym
-import numpy as np
 import os
 import time
 import torch
@@ -57,12 +63,13 @@ from stable_baselines3.common.vec_env import VecNormalize
 from isaaclab.envs import DirectMARLEnv, multi_agent_to_single_agent
 from isaaclab.utils.dict import print_dict
+from isaaclab.utils.io import load_yaml
 from isaaclab.utils.pretrained_checkpoint import get_published_pretrained_checkpoint
 from isaaclab_rl.sb3 import Sb3VecEnvWrapper, process_sb3_cfg
 import isaaclab_tasks  # noqa: F401
-from isaaclab_tasks.utils.parse_cfg import get_checkpoint_path, load_cfg_from_registry, parse_env_cfg
+from isaaclab_tasks.utils.parse_cfg import get_checkpoint_path, parse_env_cfg
 # PLACEHOLDER: Extension template (do not remove this comment)
@@ -73,7 +80,6 @@ def main():
    env_cfg = parse_env_cfg(
        args_cli.task, device=args_cli.device, num_envs=args_cli.num_envs, use_fabric=not args_cli.disable_fabric
    )
-    agent_cfg = load_cfg_from_registry(args_cli.task, "sb3_cfg_entry_point")
    task_name = args_cli.task.split(":")[-1]
@@ -87,6 +93,7 @@ def main():
            print("[INFO] Unfortunately a pre-trained checkpoint is currently unavailable for this task.")
            return
    elif args_cli.checkpoint is None:
+        # FIXME: last checkpoint doesn't seem to really use the last one'
        if args_cli.use_last_checkpoint:
            checkpoint = "model_.*.zip"
        else:
@@ -96,12 +103,14 @@ def main():
        checkpoint_path = args_cli.checkpoint
    log_dir = os.path.dirname(checkpoint_path)
-    # post-process agent configuration
-    agent_cfg = process_sb3_cfg(agent_cfg)
    # create isaac environment
    env = gym.make(args_cli.task, cfg=env_cfg, render_mode="rgb_array" if args_cli.video else None)
+    # load the exact config used for training (instead of the default config)
+    agent_cfg = load_yaml(os.path.join(log_dir, "params", "agent.yaml"))
+    # post-process agent configuration
+    agent_cfg = process_sb3_cfg(agent_cfg, env.unwrapped.num_envs)
    # convert to single-agent instance if required by the RL algorithm
    if isinstance(env.unwrapped, DirectMARLEnv):
        env = multi_agent_to_single_agent(env)
@@ -118,18 +127,25 @@ def main():
        print_dict(video_kwargs, nesting=4)
        env = gym.wrappers.RecordVideo(env, **video_kwargs)
    # wrap around environment for stable baselines
-    env = Sb3VecEnvWrapper(env)
+    env = Sb3VecEnvWrapper(env, fast_variant=not args_cli.keep_all_info)
+    vec_norm_path = checkpoint_path.replace("/model", "/model_vecnormalize").replace(".zip", ".pkl")
+    vec_norm_path = Path(vec_norm_path)
    # normalize environment (if needed)
-    if "normalize_input" in agent_cfg:
+    if vec_norm_path.exists():
+        print(f"Loading saved normalization: {vec_norm_path}")
+        env = VecNormalize.load(vec_norm_path, env)
+        #  do not update them at test time
+        env.training = False
+        # reward normalization is not needed at test time
+        env.norm_reward = False
+    elif "normalize_input" in agent_cfg:
        env = VecNormalize(
            env,
            training=True,
            norm_obs="normalize_input" in agent_cfg and agent_cfg.pop("normalize_input"),
-            norm_reward="normalize_value" in agent_cfg and agent_cfg.pop("normalize_value"),
            clip_obs="clip_obs" in agent_cfg and agent_cfg.pop("clip_obs"),
-            gamma=agent_cfg["gamma"],
-            clip_reward=np.inf,
        )
    # create agent from stable baselines

--- a/scripts/reinforcement_learning/sb3/train.py
+++ b/scripts/reinforcement_learning/sb3/train.py
@@ -3,17 +3,16 @@
 #
 # SPDX-License-Identifier: BSD-3-Clause
-"""Script to train RL agent with Stable Baselines3.
-Since Stable-Baselines3 does not support buffers living on GPU directly,
+"""Script to train RL agent with Stable Baselines3."""
-we recommend using smaller number of environments. Otherwise,
-there will be significant overhead in GPU->CPU transfer.
-"""
 """Launch Isaac Sim Simulator first."""
 import argparse
+import contextlib
+import signal
 import sys
+from pathlib import Path
 from isaaclab.app import AppLauncher
@@ -25,7 +24,14 @@ parser.add_argument("--video_interval", type=int, default=2000, help="Interval b
 parser.add_argument("--num_envs", type=int, default=None, help="Number of environments to simulate.")
 parser.add_argument("--task", type=str, default=None, help="Name of the task.")
 parser.add_argument("--seed", type=int, default=None, help="Seed used for the environment")
+parser.add_argument("--log_interval", type=int, default=100_000, help="Log data every n timesteps.")
 parser.add_argument("--max_iterations", type=int, default=None, help="RL Policy training iterations.")
+parser.add_argument(
+    "--keep_all_info",
+    action="store_true",
+    default=False,
+    help="Use a slower SB3 wrapper but keep all the extra training info.",
+)
 # append AppLauncher cli args
 AppLauncher.add_app_launcher_args(parser)
 # parse the arguments
@@ -41,6 +47,24 @@ sys.argv = [sys.argv[0]] + hydra_args
 app_launcher = AppLauncher(args_cli)
 simulation_app = app_launcher.app
+def cleanup_pbar(*args):
+    """
+    A small helper to stop training and
+    cleanup progress bar properly on ctrl+c
+    """
+    import gc
+    tqdm_objects = [obj for obj in gc.get_objects() if "tqdm" in type(obj).__name__]
+    for tqdm_object in tqdm_objects:
+        if "tqdm_rich" in type(tqdm_object).__name__:
+            tqdm_object.close()
+    raise KeyboardInterrupt
+# disable KeyboardInterrupt override
+signal.signal(signal.SIGINT, cleanup_pbar)
 """Rest everything follows."""
 import gymnasium as gym
@@ -50,8 +74,7 @@ import random
 from datetime import datetime
 from stable_baselines3 import PPO
-from stable_baselines3.common.callbacks import CheckpointCallback
+from stable_baselines3.common.callbacks import CheckpointCallback, LogEveryNTimesteps
-from stable_baselines3.common.logger import configure
 from stable_baselines3.common.vec_env import VecNormalize
 from isaaclab.envs import (
@@ -104,8 +127,12 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    dump_pickle(os.path.join(log_dir, "params", "env.pkl"), env_cfg)
    dump_pickle(os.path.join(log_dir, "params", "agent.pkl"), agent_cfg)
+    # save command used to run the script
+    command = " ".join(sys.orig_argv)
+    (Path(log_dir) / "command.txt").write_text(command)
    # post-process agent configuration
-    agent_cfg = process_sb3_cfg(agent_cfg)
+    agent_cfg = process_sb3_cfg(agent_cfg, env_cfg.scene.num_envs)
    # read configurations about the agent-training
    policy_arch = agent_cfg.pop("policy")
    n_timesteps = agent_cfg.pop("n_timesteps")
@@ -130,31 +157,49 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
        env = gym.wrappers.RecordVideo(env, **video_kwargs)
    # wrap around environment for stable baselines
-    env = Sb3VecEnvWrapper(env)
+    env = Sb3VecEnvWrapper(env, fast_variant=not args_cli.keep_all_info)
+    norm_keys = {"normalize_input", "normalize_value", "clip_obs"}
+    norm_args = {}
+    for key in norm_keys:
+        if key in agent_cfg:
+            norm_args[key] = agent_cfg.pop(key)
-    if "normalize_input" in agent_cfg:
+    if norm_args and norm_args.get("normalize_input"):
+        print(f"Normalizing input, {norm_args=}")
        env = VecNormalize(
            env,
            training=True,
-            norm_obs="normalize_input" in agent_cfg and agent_cfg.pop("normalize_input"),
+            norm_obs=norm_args["normalize_input"],
-            norm_reward="normalize_value" in agent_cfg and agent_cfg.pop("normalize_value"),
+            norm_reward=norm_args.get("normalize_value", False),
-            clip_obs="clip_obs" in agent_cfg and agent_cfg.pop("clip_obs"),
+            clip_obs=norm_args.get("clip_obs", 100.0),
            gamma=agent_cfg["gamma"],
            clip_reward=np.inf,
        )
    # create agent from stable baselines
-    agent = PPO(policy_arch, env, verbose=1, **agent_cfg)
+    agent = PPO(policy_arch, env, verbose=1, tensorboard_log=log_dir, **agent_cfg)
-    # configure the logger
-    new_logger = configure(log_dir, ["stdout", "tensorboard"])
-    agent.set_logger(new_logger)
    # callbacks for agent
    checkpoint_callback = CheckpointCallback(save_freq=1000, save_path=log_dir, name_prefix="model", verbose=2)
+    callbacks = [checkpoint_callback, LogEveryNTimesteps(n_steps=args_cli.log_interval)]
    # train the agent
-    agent.learn(total_timesteps=n_timesteps, callback=checkpoint_callback)
+    with contextlib.suppress(KeyboardInterrupt):
+        agent.learn(
+            total_timesteps=n_timesteps,
+            callback=callbacks,
+            progress_bar=True,
+            log_interval=None,
+        )
    # save the final model
    agent.save(os.path.join(log_dir, "model"))
+    print("Saving to:")
+    print(os.path.join(log_dir, "model.zip"))
+    if isinstance(env, VecNormalize):
+        print("Saving normalization")
+        env.save(os.path.join(log_dir, "model_vecnormalize.pkl"))
    # close the simulator
    env.close()

--- a/source/isaaclab_rl/config/extension.toml
+++ b/source/isaaclab_rl/config/extension.toml
 [package]
 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.1.4"
+version = "0.1.5"
 # Description
 title = "Isaac Lab RL"

--- a/source/isaaclab_rl/docs/CHANGELOG.rst
+++ b/source/isaaclab_rl/docs/CHANGELOG.rst
 Changelog
 ---------
+0.1.5 (2025-04-11)
+~~~~~~~~~~~~~~~~~~
+Changed
+^^^^^^^
+* Optimized Stable-Baselines3 wrapper ``Sb3VecEnvWrapper`` (now 4x faster) by using Numpy buffers and only logging episode and truncation information by default.
+* Upgraded minimum SB3 version to 2.6.0 and added optional dependencies for progress bar
 0.1.4 (2025-04-10)
 ~~~~~~~~~~~~~~~~~~

--- a/source/isaaclab_rl/isaaclab_rl/sb3.py
+++ b/source/isaaclab_rl/isaaclab_rl/sb3.py
@@ -22,6 +22,7 @@ import gymnasium as gym
 import numpy as np
 import torch
 import torch.nn as nn  # noqa: F401
+import warnings
 from typing import Any
 from stable_baselines3.common.utils import constant_fn
@@ -29,16 +30,20 @@ from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvObs, Vec
 from isaaclab.envs import DirectRLEnv, ManagerBasedRLEnv
+# remove SB3 warnings because PPO with bigger net actually benefits from GPU
+warnings.filterwarnings("ignore", message="You are trying to run PPO on the GPU")
 """
 Configuration Parser.
 """
-def process_sb3_cfg(cfg: dict) -> dict:
+def process_sb3_cfg(cfg: dict, num_envs: int) -> dict:
    """Convert simple YAML types to Stable-Baselines classes/components.
    Args:
        cfg: A configuration dictionary.
+        num_envs: the number of parallel environments (used to compute `batch_size` for a desired number of minibatches)
    Returns:
        A dictionary containing the converted configuration.
@@ -54,19 +59,24 @@ def process_sb3_cfg(cfg: dict) -> dict:
            else:
                if key in ["policy_kwargs", "replay_buffer_class", "replay_buffer_kwargs"]:
                    hyperparams[key] = eval(value)
-                elif key in ["learning_rate", "clip_range", "clip_range_vf", "delta_std"]:
+                elif key in ["learning_rate", "clip_range", "clip_range_vf"]:
                    if isinstance(value, str):
                        _, initial_value = value.split("_")
                        initial_value = float(initial_value)
                        hyperparams[key] = lambda progress_remaining: progress_remaining * initial_value
                    elif isinstance(value, (float, int)):
-                        # Negative value: ignore (ex: for clipping)
+                        # negative value: ignore (ex: for clipping)
                        if value < 0:
                            continue
                        hyperparams[key] = constant_fn(float(value))
                    else:
                        raise ValueError(f"Invalid value for {key}: {hyperparams[key]}")
+        # Convert to a desired batch_size (n_steps=2048 by default for SB3 PPO)
+        if "n_minibatches" in hyperparams:
+            hyperparams["batch_size"] = (hyperparams.get("n_steps", 2048) * num_envs) // hyperparams["n_minibatches"]
+            del hyperparams["n_minibatches"]
        return hyperparams
    # parse agent configuration and convert to classes
@@ -89,8 +99,8 @@ class Sb3VecEnvWrapper(VecEnv):
    Note:
        While Stable-Baselines3 supports Gym 0.26+ API, their vectorized environment
-        still uses the old API (i.e. it is closer to Gym 0.21). Thus, we implement
+        uses their own API (i.e. it is closer to Gym 0.21). Thus, we implement
-        the old API for the vectorized environment.
+        the API for the vectorized environment.
    We also add monitoring functionality that computes the un-discounted episode
    return and length. This information is added to the info dicts under key `episode`.
@@ -123,12 +133,13 @@ class Sb3VecEnvWrapper(VecEnv):
    """
-    def __init__(self, env: ManagerBasedRLEnv | DirectRLEnv):
+    def __init__(self, env: ManagerBasedRLEnv | DirectRLEnv, fast_variant: bool = True):
        """Initialize the wrapper.
        Args:
            env: The environment to wrap around.
+            fast_variant: Use fast variant for processing info
+                (Only episodic reward, lengths and truncation info are included)
        Raises:
            ValueError: When the environment is not an instance of :class:`ManagerBasedRLEnv` or :class:`DirectRLEnv`.
        """
@@ -140,6 +151,7 @@ class Sb3VecEnvWrapper(VecEnv):
            )
        # initialize the wrapper
        self.env = env
+        self.fast_variant = fast_variant
        # collect common information
        self.num_envs = self.unwrapped.num_envs
        self.sim_device = self.unwrapped.device
@@ -156,8 +168,8 @@ class Sb3VecEnvWrapper(VecEnv):
        # initialize vec-env
        VecEnv.__init__(self, self.num_envs, observation_space, action_space)
        # add buffer for logging episodic information
-        self._ep_rew_buf = torch.zeros(self.num_envs, device=self.sim_device)
+        self._ep_rew_buf = np.zeros(self.num_envs)
-        self._ep_len_buf = torch.zeros(self.num_envs, device=self.sim_device)
+        self._ep_len_buf = np.zeros(self.num_envs)
    def __str__(self):
        """Returns the wrapper name and the :attr:`env` representation string."""
@@ -190,11 +202,11 @@ class Sb3VecEnvWrapper(VecEnv):
    def get_episode_rewards(self) -> list[float]:
        """Returns the rewards of all the episodes."""
-        return self._ep_rew_buf.cpu().tolist()
+        return self._ep_rew_buf.tolist()
    def get_episode_lengths(self) -> list[int]:
        """Returns the number of time-steps of all the episodes."""
-        return self._ep_len_buf.cpu().tolist()
+        return self._ep_len_buf.tolist()
    """
    Operations - MDP
@@ -206,8 +218,8 @@ class Sb3VecEnvWrapper(VecEnv):
    def reset(self) -> VecEnvObs:  # noqa: D102
        obs_dict, _ = self.env.reset()
        # reset episodic information buffers
-        self._ep_rew_buf.zero_()
+        self._ep_rew_buf = np.zeros(self.num_envs)
-        self._ep_len_buf.zero_()
+        self._ep_len_buf = np.zeros(self.num_envs)
        # convert data types to numpy depending on backend
        return self._process_obs(obs_dict)
@@ -224,28 +236,30 @@ class Sb3VecEnvWrapper(VecEnv):
    def step_wait(self) -> VecEnvStepReturn:  # noqa: D102
        # record step information
        obs_dict, rew, terminated, truncated, extras = self.env.step(self._async_actions)
-        # update episode un-discounted return and length
-        self._ep_rew_buf += rew
-        self._ep_len_buf += 1
        # compute reset ids
        dones = terminated | truncated
-        reset_ids = (dones > 0).nonzero(as_tuple=False)
        # convert data types to numpy depending on backend
        # note: ManagerBasedRLEnv uses torch backend (by default).
        obs = self._process_obs(obs_dict)
-        rew = rew.detach().cpu().numpy()
+        rewards = rew.detach().cpu().numpy()
        terminated = terminated.detach().cpu().numpy()
        truncated = truncated.detach().cpu().numpy()
        dones = dones.detach().cpu().numpy()
+        reset_ids = dones.nonzero()[0]
+        # update episode un-discounted return and length
+        self._ep_rew_buf += rewards
+        self._ep_len_buf += 1
        # convert extra information to list of dicts
        infos = self._process_extras(obs, terminated, truncated, extras, reset_ids)
        # reset info for terminated environments
-        self._ep_rew_buf[reset_ids] = 0
+        self._ep_rew_buf[reset_ids] = 0.0
        self._ep_len_buf[reset_ids] = 0
-        return obs, rew, dones, infos
+        return obs, rewards, dones, infos
    def close(self):  # noqa: D102
        self.env.close()
@@ -279,7 +293,8 @@ class Sb3VecEnvWrapper(VecEnv):
            return env_method(*method_args, indices=indices, **method_kwargs)
    def env_is_wrapped(self, wrapper_class, indices=None):  # noqa: D102
-        raise NotImplementedError("Checking if environment is wrapped is not supported.")
+        # fake implementation to be able to use `evaluate_policy()` helper
+        return [False]
    def get_images(self):  # noqa: D102
        raise NotImplementedError("Getting images is not supported.")
@@ -306,6 +321,29 @@ class Sb3VecEnvWrapper(VecEnv):
        self, obs: np.ndarray, terminated: np.ndarray, truncated: np.ndarray, extras: dict, reset_ids: np.ndarray
    ) -> list[dict[str, Any]]:
        """Convert miscellaneous information into dictionary for each sub-environment."""
+        # faster version: only process env that terminated and add bootstrapping info
+        if self.fast_variant:
+            infos = [{} for _ in range(self.num_envs)]
+            for idx in reset_ids:
+                # fill-in episode monitoring info
+                infos[idx]["episode"] = {
+                    "r": self._ep_rew_buf[idx],
+                    "l": self._ep_len_buf[idx],
+                }
+                # fill-in bootstrap information
+                infos[idx]["TimeLimit.truncated"] = truncated[idx] and not terminated[idx]
+                # add information about terminal observation separately
+                if isinstance(obs, dict):
+                    terminal_obs = {key: value[idx] for key, value in obs.items()}
+                else:
+                    terminal_obs = obs[idx]
+                infos[idx]["terminal_observation"] = terminal_obs
+            return infos
        # create empty list of dictionaries to fill
        infos: list[dict[str, Any]] = [dict.fromkeys(extras.keys()) for _ in range(self.num_envs)]
        # fill-in information for each sub-environment

--- a/source/isaaclab_rl/setup.py
+++ b/source/isaaclab_rl/setup.py
@@ -41,7 +41,7 @@ PYTORCH_INDEX_URL = ["https://download.pytorch.org/whl/cu118"]
 # Extra dependencies for RL agents
 EXTRAS_REQUIRE = {
-    "sb3": ["stable-baselines3>=2.1"],
+    "sb3": ["stable-baselines3>=2.6", "tqdm", "rich"],  # tqdm/rich for progress bar
    "skrl": ["skrl>=1.4.2"],
    "rl-games": ["rl-games==1.6.1", "gym"],  # rl-games still needs gym :(
    "rsl-rl": ["rsl-rl-lib==2.3.3"],

--- a/source/isaaclab_tasks/isaaclab_tasks/manager_based/classic/humanoid/agents/sb3_ppo_cfg.yaml
+++ b/source/isaaclab_tasks/isaaclab_tasks/manager_based/classic/humanoid/agents/sb3_ppo_cfg.yaml
-# Reference: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml#L245
+# Adapted from rsl_rl config
 seed: 42
+policy: "MlpPolicy"
-policy: 'MlpPolicy'
 n_timesteps: !!float 5e7
-batch_size: 256
+# For 4 minibatches with 4096 envs
-n_steps: 512
+# batch_size = (n_envs * n_steps) / n_minibatches = 32768
+n_minibatches: 4
+n_steps: 32
 gamma: 0.99
-learning_rate: !!float 2.5e-4
+learning_rate: !!float 5e-4
 ent_coef: 0.0
 clip_range: 0.2
-n_epochs: 10
+n_epochs: 5
 gae_lambda: 0.95
 max_grad_norm: 1.0
 vf_coef: 0.5
-device: "cuda:0"
 policy_kwargs: "dict(
-                  log_std_init=-1,
+  activation_fn=nn.ELU,
-                  ortho_init=False,
+  net_arch=[400, 200, 100],
-                  activation_fn=nn.ReLU,
+  optimizer_kwargs=dict(eps=1e-8),
-                  net_arch=dict(pi=[256, 256], vf=[256, 256])
+  ortho_init=False,
-                )"
+  )"
--- a/source/isaaclab_tasks/isaaclab_tasks/manager_based/locomotion/velocity/config/a1/__init__.py
+++ b/source/isaaclab_tasks/isaaclab_tasks/manager_based/locomotion/velocity/config/a1/__init__.py
@@ -19,6 +19,7 @@ gym.register(
        "env_cfg_entry_point": f"{__name__}.flat_env_cfg:UnitreeA1FlatEnvCfg",
        "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:UnitreeA1FlatPPORunnerCfg",
        "skrl_cfg_entry_point": f"{agents.__name__}:skrl_flat_ppo_cfg.yaml",
+        "sb3_cfg_entry_point": f"{agents.__name__}:sb3_ppo_cfg.yaml",
    },
 )
@@ -30,6 +31,7 @@ gym.register(
        "env_cfg_entry_point": f"{__name__}.flat_env_cfg:UnitreeA1FlatEnvCfg_PLAY",
        "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:UnitreeA1FlatPPORunnerCfg",
        "skrl_cfg_entry_point": f"{agents.__name__}:skrl_flat_ppo_cfg.yaml",
+        "sb3_cfg_entry_point": f"{agents.__name__}:sb3_ppo_cfg.yaml",
    },
 )
@@ -41,6 +43,7 @@ gym.register(
        "env_cfg_entry_point": f"{__name__}.rough_env_cfg:UnitreeA1RoughEnvCfg",
        "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:UnitreeA1RoughPPORunnerCfg",
        "skrl_cfg_entry_point": f"{agents.__name__}:skrl_rough_ppo_cfg.yaml",
+        "sb3_cfg_entry_point": f"{agents.__name__}:sb3_ppo_cfg.yaml",
    },
 )
@@ -52,5 +55,6 @@ gym.register(
        "env_cfg_entry_point": f"{__name__}.rough_env_cfg:UnitreeA1RoughEnvCfg_PLAY",
        "rsl_rl_cfg_entry_point": f"{agents.__name__}.rsl_rl_ppo_cfg:UnitreeA1RoughPPORunnerCfg",
        "skrl_cfg_entry_point": f"{agents.__name__}:skrl_rough_ppo_cfg.yaml",
+        "sb3_cfg_entry_point": f"{agents.__name__}:sb3_ppo_cfg.yaml",
    },
 )
--- a/source/isaaclab_tasks/isaaclab_tasks/manager_based/locomotion/velocity/config/a1/agents/sb3_ppo_cfg.yaml
+++ b/source/isaaclab_tasks/isaaclab_tasks/manager_based/locomotion/velocity/config/a1/agents/sb3_ppo_cfg.yaml
+# Adapted from rsl_rl config
+seed: 42
+n_timesteps: !!float 5e7
+policy: 'MlpPolicy'
+n_steps: 24
+n_minibatches: 4  # batch_size=24576 for n_envs=4096 and n_steps=24
+gae_lambda: 0.95
+gamma: 0.99
+n_epochs: 5
+ent_coef: 0.005
+learning_rate: !!float 1e-3
+clip_range: !!float 0.2
+policy_kwargs: "dict(
+                  activation_fn=nn.ELU,
+                  net_arch=[512, 256, 128],
+                  optimizer_kwargs=dict(eps=1e-8),
+                  ortho_init=False,
+                )"
+vf_coef: 1.0
+max_grad_norm: 1.0
+normalize_input: True
+normalize_value: False
+clip_obs: 10.0