Adds PBT algorithm to rl games (#3399)

# Description This PR introduces the Population Based Training algorithm originally implemented in Petrenko, Aleksei, et al. "Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training." arXiv preprint arXiv:2305.12127 (2023). Pbt algorithm offers a alternative to scaling when increasing number of environment has margin effect. It takes idea in natural selection and stochastic property in rl-training to always keeps the top performing agent while replace weak agent with top performance to overcome the catastrophic failure, and improve the exploration. Training view, underperformers are rescued by best performers and later surpasses them and become best performers <img width="1078" height="509" alt="Screenshot from 2025-09-09 00-55-11" src="https://github.com/user-attachments/assets/34434bf1-5cb6-4956-a344-49c9969d4861" /> Note: PBT is still at beta phase and has below limitations: 1. in theory It can work with any rl algorithm but current implementation only works for rl-games 2. The API could be furthur simplified without needing explicitly input num_policies or policy_idx, which allows for dynamic max_population, but it is for future work ## Screenshots Please attach before and after screenshots of the change if applicable.  ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Adds PBT algorithm to rl games (#3399)
# Description This PR introduces the Population Based Training algorithm originally implemented in Petrenko, Aleksei, et al. "Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training." arXiv preprint arXiv:2305.12127 (2023). Pbt algorithm offers a alternative to scaling when increasing number of environment has margin effect. It takes idea in natural selection and stochastic property in rl-training to always keeps the top performing agent while replace weak agent with top performance to overcome the catastrophic failure, and improve the exploration. Training view, underperformers are rescued by best performers and later surpasses them and become best performers <img width="1078" height="509" alt="Screenshot from 2025-09-09 00-55-11" src="https://github.com/user-attachments/assets/34434bf1-5cb6-4956-a344-49c9969d4861" /> Note: PBT is still at beta phase and has below limitations: 1. in theory It can work with any rl algorithm but current implementation only works for rl-games 2. The API could be furthur simplified without needing explicitly input num_policies or policy_idx, which allows for dynamic max_population, but it is for future work ## Screenshots Please attach before and after screenshots of the change if applicable.  ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
40c8d16d · ooctipus · GitHub · c7dde1b7 · 40c8d16d · 40c8d16d
Unverified Commit 40c8d16d authored Sep 09, 2025 by ooctipus Committed by GitHub Sep 09, 2025
12 changed files
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -116,6 +116,7 @@ Table of Contents

   source/features/hydra
   source/features/multi_gpu
+   source/features/population_based_training
   Tiled Rendering</source/overview/core-concepts/sensors/camera>
   source/features/ray
   source/features/reproducibility

--- a/docs/source/features/population_based_training.rst
+++ b/docs/source/features/population_based_training.rst
+Population Based Training
+=========================
+
+What PBT Does
+-------------
+
+* Trains *N* policies in parallel (a "population") on the **same task**.
+* Every ``interval_steps``:
+
+  #. Save each policy's checkpoint and objective.
+  #. Score the population and identify **leaders** and **underperformers**.
+  #. For underperformers, replace weights from a random leader and **mutate** selected hyperparameters.
+  #. Restart that process with the new weights/params automatically.
+
+Leader / Underperformer Selection
+---------------------------------
+
+Let ``o_i`` be each initialized policy's objective, with mean ``μ`` and std ``σ``.
+
+Upper and lower performance cuts are::
+
+  upper_cut = max(μ + threshold_std * σ, μ + threshold_abs)
+  lower_cut = min(μ - threshold_std * σ, μ - threshold_abs)
+
+* **Leaders**: ``o_i > upper_cut``
+* **Underperformers**: ``o_i < lower_cut``
+
+The "Natural-Selection" rules:
+
+1. Only underperformers are acted on (mutated or replaced).
+2. If leaders exist, replace an underperformer with a random leader; otherwise, self-mutate.
+
+Mutation (Hyperparameters)
+--------------------------
+
+* Each param has a mutation function (e.g., ``mutate_float``, ``mutate_discount``, etc.).
+* A param is mutated with probability ``mutation_rate``.
+* When mutated, its value is perturbed within ``change_range = (min, max)``.
+* Only whitelisted keys (from the PBT config) are considered.
+
+Example Config
+--------------
+
+.. code-block:: yaml
+
+   pbt:
+     enabled: True
+     policy_idx: 0
+     num_policies: 8
+     directory: .
+     workspace: "pbt_workspace"
+     objective: Curriculum/difficulty_level
+     interval_steps: 50000000
+     threshold_std: 0.1
+     threshold_abs: 0.025
+     mutation_rate: 0.25
+     change_range: [1.1, 2.0]
+     mutation:
+       agent.params.config.learning_rate: "mutate_float"
+       agent.params.config.grad_norm: "mutate_float"
+       agent.params.config.entropy_coef: "mutate_float"
+       agent.params.config.critic_coef: "mutate_float"
+       agent.params.config.bounds_loss_coef: "mutate_float"
+       agent.params.config.kl_threshold: "mutate_float"
+       agent.params.config.gamma: "mutate_discount"
+       agent.params.config.tau: "mutate_discount"
+
+
+``objective: Curriculum/difficulty_level`` uses ``infos["episode"]["Curriculum/difficulty_level"]`` as the scalar to
+**rank policies** (higher is better).  With ``num_policies: 8``, launch eight processes sharing the same ``workspace``
+and unique ``policy_idx`` (0-7).
+
+
+Launching PBT
+-------------
+
+You must start **one process per policy** and point them to the **same workspace**. Set a unique
+``policy_idx`` for each process and the common ``num_policies``.
+
+Minimal flags you need:
+
+* ``agent.pbt.enabled=True``
+* ``agent.pbt.workspace=<path/to/shared_folder>``
+* ``agent.pbt.policy_idx=<0..num_policies-1>``
+* ``agent.pbt.num_policies=<N>``
+
+.. note::
+   All processes must use the same ``agent.pbt.workspace`` so they can see each other's checkpoints.
+
+.. caution::
+   PBT is currently supported **only** with the **rl_games** library. Other RL libraries are not supported yet.
+
+Tips
+----
+
+* Keep checkpoints fast: reduce ``interval_steps`` only if you really need tighter PBT cadence.
+* It is recommended to run 6+ workers to see benefit of pbt
+
+
+References
+----------
+
+This PBT implementation reimplements and is inspired by *Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training* (Petrenko et al., 2023).
+
+.. code-block:: bibtex
+
+   @article{petrenko2023dexpbt,
+     title={Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training},
+     author={Petrenko, Aleksei and Allshire, Arthur and State, Gavriel and Handa, Ankur and Makoviychuk, Viktor},
+     journal={arXiv preprint arXiv:2305.12127},
+     year={2023}
+   }
--- a/scripts/reinforcement_learning/rl_games/train.py
+++ b/scripts/reinforcement_learning/rl_games/train.py
@@ -81,7 +81,7 @@ from isaaclab.utils.assets import retrieve_file_path
 from isaaclab.utils.dict import print_dict
 from isaaclab.utils.io import dump_pickle, dump_yaml

-from isaaclab_rl.rl_games import RlGamesGpuEnv, RlGamesVecEnvWrapper
+from isaaclab_rl.rl_games import MultiObserver, PbtAlgoObserver, RlGamesGpuEnv, RlGamesVecEnvWrapper

 import isaaclab_tasks  # noqa: F401
 from isaaclab_tasks.utils.hydra import hydra_task_config
@@ -127,7 +127,12 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    # specify directory for logging experiments
    config_name = agent_cfg["params"]["config"]["name"]
    log_root_path = os.path.join("logs", "rl_games", config_name)
-    log_root_path = os.path.abspath(log_root_path)
+    if "pbt" in agent_cfg:
+        if agent_cfg["pbt"]["directory"] == ".":
+            log_root_path = os.path.abspath(log_root_path)
+        else:
+            log_root_path = os.path.join(agent_cfg["pbt"]["directory"], log_root_path)
+
    print(f"[INFO] Logging experiment in directory: {log_root_path}")
    # specify directory for logging runs
    log_dir = agent_cfg["params"]["config"].get("full_experiment_name", datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
@@ -192,7 +197,13 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    # set number of actors into agent config
    agent_cfg["params"]["config"]["num_actors"] = env.unwrapped.num_envs
    # create runner from rl-games
-    runner = Runner(IsaacAlgoObserver())
+
+    if "pbt" in agent_cfg and agent_cfg["pbt"]["enabled"]:
+        observers = MultiObserver([IsaacAlgoObserver(), PbtAlgoObserver(agent_cfg, args_cli)])
+        runner = Runner(observers)
+    else:
+        runner = Runner(IsaacAlgoObserver())
+
    runner.load(agent_cfg)

    # reset the agent and env

--- a/source/isaaclab_rl/config/extension.toml
+++ b/source/isaaclab_rl/config/extension.toml
 [package]

 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.3.0"
+version = "0.4.0"

 # Description
 title = "Isaac Lab RL"

--- a/source/isaaclab_rl/docs/CHANGELOG.rst
+++ b/source/isaaclab_rl/docs/CHANGELOG.rst
 Changelog
 ---------

+0.4.0 (2025-09-09)
+~~~~~~~~~~~~~~~~~~
+
+Added
+^^^^^
+
+* Introduced PBT to rl-games.
+
+
 0.3.0 (2025-09-03)
 ~~~~~~~~~~~~~~~~~~


--- a/source/isaaclab_rl/isaaclab_rl/rl_games/__init__.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/__init__.py
+# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+"""Wrappers and utilities to configure an environment for rl-games library."""
+
+from .pbt import *
+from .rl_games import *
--- a/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/__init__.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/__init__.py
+# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+from .pbt import MultiObserver, PbtAlgoObserver
+from .pbt_cfg import PbtCfg
--- a/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/mutation.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/mutation.py
+# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+import random
+from collections.abc import Callable
+from typing import Any
+
+
+def mutate_float(x: float, change_min: float = 1.1, change_max: float = 1.5) -> float:
+    """Multiply or divide by a random factor in [change_min, change_max]."""
+    k = random.uniform(change_min, change_max)
+    return x / k if random.random() < 0.5 else x * k
+
+
+def mutate_discount(x: float, **kwargs) -> float:
+    """Conservative change near 1.0 by mutating (1 - x) in [1.1, 1.2]."""
+    inv = 1.0 - x
+    new_inv = mutate_float(inv, change_min=1.1, change_max=1.2)
+    return 1.0 - new_inv
+
+
+MUTATION_FUNCS: dict[str, Callable[..., Any]] = {
+    "mutate_float": mutate_float,
+    "mutate_discount": mutate_discount,
+}
+
+
+def mutate(
+    params: dict[str, Any],
+    mutations: dict[str, str],
+    mutation_rate: float,
+    change_range: tuple[float, float],
+) -> dict[str, Any]:
+    cmin, cmax = change_range
+    out: dict[str, Any] = {}
+    for name, val in params.items():
+        fn_name = mutations.get(name)
+        # skip if no rule or coin flip says "no"
+        if fn_name is None or random.random() > mutation_rate:
+            out[name] = val
+            continue
+        fn = MUTATION_FUNCS.get(fn_name)
+        if fn is None:
+            raise KeyError(f"Unknown mutation function: {fn_name!r}")
+        out[name] = fn(val, change_min=cmin, change_max=cmax)
+    return out
--- a/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt.py
--- a/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt_cfg.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt_cfg.py
+# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+from isaaclab.utils import configclass
+
+
+@configclass
+class PbtCfg:
+    """
+    Population-Based Training (PBT) configuration.
+
+    leaders are policies with score > max(mean + threshold_std*std, mean + threshold_abs).
+    underperformers are policies with score < min(mean - threshold_std*std, mean - threshold_abs).
+    On replacement, selected hyperparameters are mutated multiplicatively in [change_min, change_max].
+    """
+
+    enabled: bool = False
+    """Enable/disable PBT logic."""
+
+    policy_idx: int = 0
+    """Index of this learner in the population (unique in [0, num_policies-1])."""
+
+    num_policies: int = 8
+    """Total number of learners participating in PBT."""
+
+    directory: str = ""
+    """Root directory for PBT artifacts (checkpoints, metadata)."""
+
+    workspace: str = "pbt_workspace"
+    """Subfolder under the training dir to isolate this PBT run."""
+
+    objective: str = "Episode_Reward/success"
+    """The key in info returned by env.step that pbt measures to determine leaders and underperformers,
+    If reward is stationary, using the term that corresponds to task success is usually enough, when reward
+    are non-stationary, consider uses better objectives.
+    """
+
+    interval_steps: int = 100_000
+    """Environment steps between PBT iterations (save, compare, replace/mutate)."""
+
+    threshold_std: float = 0.10
+    """Std-based margin k in max(mean ± k·std, mean ± threshold_abs) for leader/underperformer cuts."""
+
+    threshold_abs: float = 0.05
+    """Absolute margin A in max(mean ± threshold_std·std, mean ± A) for leader/underperformer cuts."""
+
+    mutation_rate: float = 0.25
+    """Per-parameter probability of mutation when a policy is replaced."""
+
+    change_range: tuple[float, float] = (1.1, 2.0)
+    """Lower and upper bound of multiplicative change factor (sampled in [change_min, change_max])."""
+
+    mutation: dict[str, str] = {}
+    """Mutation strings indicating which parameter will be mutated when pbt restart
+    example:
+        {
+            "agent.params.config.learning_rate": "mutate_float"
+            "agent.params.config.grad_norm": "mutate_float"
+            "agent.params.config.entropy_coef": "mutate_float"
+        }
+    """
--- a/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt_utils.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games/pbt/pbt_utils.py
--- a/source/isaaclab_rl/isaaclab_rl/rl_games.py
+++ b/source/isaaclab_rl/isaaclab_rl/rl_games.py