Unverified Commit 40c8d16d authored by ooctipus's avatar ooctipus Committed by GitHub

Adds PBT algorithm to rl games (#3399)

# Description

This PR introduces the Population Based Training algorithm originally
implemented in

Petrenko, Aleksei, et al. "Dexpbt: Scaling up dexterous manipulation for
hand-arm systems with population based training." arXiv preprint
arXiv:2305.12127 (2023).

Pbt algorithm offers a alternative to scaling when increasing number of
environment has margin effect.
It takes idea in natural selection and stochastic property in
rl-training to always keeps the top performing agent while replace weak
agent with top performance to overcome the catastrophic failure, and
improve the exploration.

Training view, underperformers are rescued by best performers and later
surpasses them and become best performers
<img width="1078" height="509" alt="Screenshot from 2025-09-09 00-55-11"
src="https://github.com/user-attachments/assets/34434bf1-5cb6-4956-a344-49c9969d4861"
/>


Note:
PBT is still at beta phase and has below limitations:

1. in theory It can work with any rl algorithm but current
implementation only works for rl-games
2. The API could be furthur simplified without needing explicitly input
num_policies or policy_idx, which allows for dynamic max_population, but
it is for future work

## Screenshots

Please attach before and after screenshots of the change if applicable.

<!--
Example:

| Before | After |
| ------ | ----- |
| _gif/png before_ | _gif/png after_ |

To upload images to a PR -- simply drag and drop an image while in edit
mode and it should upload the image directly. You can then paste that
source into the above before/after sections.
-->

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [x] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
parent c7dde1b7
......@@ -116,6 +116,7 @@ Table of Contents
source/features/hydra
source/features/multi_gpu
source/features/population_based_training
Tiled Rendering</source/overview/core-concepts/sensors/camera>
source/features/ray
source/features/reproducibility
......
Population Based Training
=========================
What PBT Does
-------------
* Trains *N* policies in parallel (a "population") on the **same task**.
* Every ``interval_steps``:
#. Save each policy's checkpoint and objective.
#. Score the population and identify **leaders** and **underperformers**.
#. For underperformers, replace weights from a random leader and **mutate** selected hyperparameters.
#. Restart that process with the new weights/params automatically.
Leader / Underperformer Selection
---------------------------------
Let ``o_i`` be each initialized policy's objective, with mean ``μ`` and std ``σ``.
Upper and lower performance cuts are::
upper_cut = max(μ + threshold_std * σ, μ + threshold_abs)
lower_cut = min(μ - threshold_std * σ, μ - threshold_abs)
* **Leaders**: ``o_i > upper_cut``
* **Underperformers**: ``o_i < lower_cut``
The "Natural-Selection" rules:
1. Only underperformers are acted on (mutated or replaced).
2. If leaders exist, replace an underperformer with a random leader; otherwise, self-mutate.
Mutation (Hyperparameters)
--------------------------
* Each param has a mutation function (e.g., ``mutate_float``, ``mutate_discount``, etc.).
* A param is mutated with probability ``mutation_rate``.
* When mutated, its value is perturbed within ``change_range = (min, max)``.
* Only whitelisted keys (from the PBT config) are considered.
Example Config
--------------
.. code-block:: yaml
pbt:
enabled: True
policy_idx: 0
num_policies: 8
directory: .
workspace: "pbt_workspace"
objective: Curriculum/difficulty_level
interval_steps: 50000000
threshold_std: 0.1
threshold_abs: 0.025
mutation_rate: 0.25
change_range: [1.1, 2.0]
mutation:
agent.params.config.learning_rate: "mutate_float"
agent.params.config.grad_norm: "mutate_float"
agent.params.config.entropy_coef: "mutate_float"
agent.params.config.critic_coef: "mutate_float"
agent.params.config.bounds_loss_coef: "mutate_float"
agent.params.config.kl_threshold: "mutate_float"
agent.params.config.gamma: "mutate_discount"
agent.params.config.tau: "mutate_discount"
``objective: Curriculum/difficulty_level`` uses ``infos["episode"]["Curriculum/difficulty_level"]`` as the scalar to
**rank policies** (higher is better). With ``num_policies: 8``, launch eight processes sharing the same ``workspace``
and unique ``policy_idx`` (0-7).
Launching PBT
-------------
You must start **one process per policy** and point them to the **same workspace**. Set a unique
``policy_idx`` for each process and the common ``num_policies``.
Minimal flags you need:
* ``agent.pbt.enabled=True``
* ``agent.pbt.workspace=<path/to/shared_folder>``
* ``agent.pbt.policy_idx=<0..num_policies-1>``
* ``agent.pbt.num_policies=<N>``
.. note::
All processes must use the same ``agent.pbt.workspace`` so they can see each other's checkpoints.
.. caution::
PBT is currently supported **only** with the **rl_games** library. Other RL libraries are not supported yet.
Tips
----
* Keep checkpoints fast: reduce ``interval_steps`` only if you really need tighter PBT cadence.
* It is recommended to run 6+ workers to see benefit of pbt
References
----------
This PBT implementation reimplements and is inspired by *Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training* (Petrenko et al., 2023).
.. code-block:: bibtex
@article{petrenko2023dexpbt,
title={Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training},
author={Petrenko, Aleksei and Allshire, Arthur and State, Gavriel and Handa, Ankur and Makoviychuk, Viktor},
journal={arXiv preprint arXiv:2305.12127},
year={2023}
}
......@@ -81,7 +81,7 @@ from isaaclab.utils.assets import retrieve_file_path
from isaaclab.utils.dict import print_dict
from isaaclab.utils.io import dump_pickle, dump_yaml
from isaaclab_rl.rl_games import RlGamesGpuEnv, RlGamesVecEnvWrapper
from isaaclab_rl.rl_games import MultiObserver, PbtAlgoObserver, RlGamesGpuEnv, RlGamesVecEnvWrapper
import isaaclab_tasks # noqa: F401
from isaaclab_tasks.utils.hydra import hydra_task_config
......@@ -127,7 +127,12 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
# specify directory for logging experiments
config_name = agent_cfg["params"]["config"]["name"]
log_root_path = os.path.join("logs", "rl_games", config_name)
log_root_path = os.path.abspath(log_root_path)
if "pbt" in agent_cfg:
if agent_cfg["pbt"]["directory"] == ".":
log_root_path = os.path.abspath(log_root_path)
else:
log_root_path = os.path.join(agent_cfg["pbt"]["directory"], log_root_path)
print(f"[INFO] Logging experiment in directory: {log_root_path}")
# specify directory for logging runs
log_dir = agent_cfg["params"]["config"].get("full_experiment_name", datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
......@@ -192,7 +197,13 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
# set number of actors into agent config
agent_cfg["params"]["config"]["num_actors"] = env.unwrapped.num_envs
# create runner from rl-games
runner = Runner(IsaacAlgoObserver())
if "pbt" in agent_cfg and agent_cfg["pbt"]["enabled"]:
observers = MultiObserver([IsaacAlgoObserver(), PbtAlgoObserver(agent_cfg, args_cli)])
runner = Runner(observers)
else:
runner = Runner(IsaacAlgoObserver())
runner.load(agent_cfg)
# reset the agent and env
......
[package]
# Note: Semantic Versioning is used: https://semver.org/
version = "0.3.0"
version = "0.4.0"
# Description
title = "Isaac Lab RL"
......
Changelog
---------
0.4.0 (2025-09-09)
~~~~~~~~~~~~~~~~~~
Added
^^^^^
* Introduced PBT to rl-games.
0.3.0 (2025-09-03)
~~~~~~~~~~~~~~~~~~
......
# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
"""Wrappers and utilities to configure an environment for rl-games library."""
from .pbt import *
from .rl_games import *
# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
from .pbt import MultiObserver, PbtAlgoObserver
from .pbt_cfg import PbtCfg
# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import random
from collections.abc import Callable
from typing import Any
def mutate_float(x: float, change_min: float = 1.1, change_max: float = 1.5) -> float:
"""Multiply or divide by a random factor in [change_min, change_max]."""
k = random.uniform(change_min, change_max)
return x / k if random.random() < 0.5 else x * k
def mutate_discount(x: float, **kwargs) -> float:
"""Conservative change near 1.0 by mutating (1 - x) in [1.1, 1.2]."""
inv = 1.0 - x
new_inv = mutate_float(inv, change_min=1.1, change_max=1.2)
return 1.0 - new_inv
MUTATION_FUNCS: dict[str, Callable[..., Any]] = {
"mutate_float": mutate_float,
"mutate_discount": mutate_discount,
}
def mutate(
params: dict[str, Any],
mutations: dict[str, str],
mutation_rate: float,
change_range: tuple[float, float],
) -> dict[str, Any]:
cmin, cmax = change_range
out: dict[str, Any] = {}
for name, val in params.items():
fn_name = mutations.get(name)
# skip if no rule or coin flip says "no"
if fn_name is None or random.random() > mutation_rate:
out[name] = val
continue
fn = MUTATION_FUNCS.get(fn_name)
if fn is None:
raise KeyError(f"Unknown mutation function: {fn_name!r}")
out[name] = fn(val, change_min=cmin, change_max=cmax)
return out
This diff is collapsed.
# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
from isaaclab.utils import configclass
@configclass
class PbtCfg:
"""
Population-Based Training (PBT) configuration.
leaders are policies with score > max(mean + threshold_std*std, mean + threshold_abs).
underperformers are policies with score < min(mean - threshold_std*std, mean - threshold_abs).
On replacement, selected hyperparameters are mutated multiplicatively in [change_min, change_max].
"""
enabled: bool = False
"""Enable/disable PBT logic."""
policy_idx: int = 0
"""Index of this learner in the population (unique in [0, num_policies-1])."""
num_policies: int = 8
"""Total number of learners participating in PBT."""
directory: str = ""
"""Root directory for PBT artifacts (checkpoints, metadata)."""
workspace: str = "pbt_workspace"
"""Subfolder under the training dir to isolate this PBT run."""
objective: str = "Episode_Reward/success"
"""The key in info returned by env.step that pbt measures to determine leaders and underperformers,
If reward is stationary, using the term that corresponds to task success is usually enough, when reward
are non-stationary, consider uses better objectives.
"""
interval_steps: int = 100_000
"""Environment steps between PBT iterations (save, compare, replace/mutate)."""
threshold_std: float = 0.10
"""Std-based margin k in max(mean ± k·std, mean ± threshold_abs) for leader/underperformer cuts."""
threshold_abs: float = 0.05
"""Absolute margin A in max(mean ± threshold_std·std, mean ± A) for leader/underperformer cuts."""
mutation_rate: float = 0.25
"""Per-parameter probability of mutation when a policy is replaced."""
change_range: tuple[float, float] = (1.1, 2.0)
"""Lower and upper bound of multiplicative change factor (sampled in [change_min, change_max])."""
mutation: dict[str, str] = {}
"""Mutation strings indicating which parameter will be mutated when pbt restart
example:
{
"agent.params.config.learning_rate": "mutate_float"
"agent.params.config.grad_norm": "mutate_float"
"agent.params.config.entropy_coef": "mutate_float"
}
"""
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment