Update documentation on pytorch multi gpu setup (#2687)

# Description  Update the Multi GPU documentation to include more information about how we integrate with Pytorch and include more documentation links. Fixes # (issue)  ## Type of change  - This change requires a documentation update ## Screenshots Please attach before and after screenshots of the change if applicable.  ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Update documentation on pytorch multi gpu setup (#2687)
# Description  Update the Multi GPU documentation to include more information about how we integrate with Pytorch and include more documentation links. Fixes # (issue)  ## Type of change  - This change requires a documentation update ## Screenshots Please attach before and after screenshots of the change if applicable.  ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
aa421304 · Alex-Omar-Nvidia · Kelly Guo · ed584581 · aa421304 · aa421304
Commit aa421304 authored Jul 30, 2025 by Alex-Omar-Nvidia Committed by Kelly Guo Jul 31, 2025
Show whitespace changes
Inline Side-by-side

Showing with 52 additions and 13 deletions

CONTRIBUTORS.md CONTRIBUTORS.md +1 -0

multi_gpu.rst docs/source/features/multi_gpu.rst +51 -13

No files found.
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -37,6 +37,7 @@ Guidelines for modifications:
 ## Contributors

 * Alessandro Assirelli
+* Alex Omar
 * Alice Zhou
 * Amr Mousa
 * Andrej Orsula

--- a/docs/source/features/multi_gpu.rst
+++ b/docs/source/features/multi_gpu.rst
@@ -16,19 +16,54 @@ other workflows.
 Multi-GPU Training
 ------------------

-For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs.
-This is possible in Isaac Lab through the use of the
-`PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_ framework or the
-`JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_ module respectively.
-
-In PyTorch, the :meth:`torch.distributed` API is used to launch multiple processes of training, where the number of
-processes must be equal to or less than the number of GPUs available. Each process runs on
-a dedicated GPU and launches its own instance of Isaac Sim and the Isaac Lab environment.
-Each process collects its own rollouts during the training process and has its own copy of the policy
-network. During training, gradients are aggregated across the processes and broadcasted back to the process
-at the end of the epoch.
-
-In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation,
+Isaac Lab supports the following multi-GPU training frameworks:
+* `Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ through `PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_
+* `JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_
+
+Pytorch Torchrun Implementation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We are using `Pytorch Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ to manage multi-GPU
+training. Torchrun manages the distributed training by:
+
+* **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU.
+* **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process.
+* **Environment Instances**: Each process creates its own instance of the Isaac Lab environment.
+* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized
+gradients back to each process after each training step.
+
+.. tip::
+    Check out this `3 minute youtube video from PyTorch <https://www.youtube.com/watch?v=Cvdhwx-OBBo&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=2>`_
+    to understand how Torchrun works.
+
+The key components in this setup are:
+
+* **Torchrun**: Handles process spawning, communication, and gradient synchronization.
+* **RL Library**: The reinforcement learning library that runs the actual training algorithm.
+* **Isaac Lab**: Provides the simulation environment that each process instantiates independently.
+
+Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_
+module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:
+
+* Each GPU runs an independent process
+* Each process executes the full training script
+* Each process maintains its own:
+  * Isaac Lab environment instance (with *n* parallel environments)
+  * Policy network copy
+  * Experience buffer for rollout collection
+* All processes synchronize only for gradient updates
+
+For a deeper dive into how Torchrun works, checkout
+`PyTorch Docs: DistributedDataParallel - Internal Design <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
+
+Jax Implementation
+^^^^^^^^^^^^^^^^^^
+
+.. tip::
+    JAX is only supported with the skrl library.
+
+With JAX, we are using `skrl.utils.distributed.jax <https://skrl.readthedocs.io/en/latest/api/utils/distributed.html>`_
+Since the ML framework doesn't automatically start multiple processes from a single program invocation,
 the skrl library provides a module to start them.

 .. image:: ../_static/multi-gpu-rl/a3c-light.svg
@@ -45,6 +80,9 @@ the skrl library provides a module to start them.

 |

+Running Multi-GPU Training
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 To train with multiple GPUs, use the following command, where ``--nproc_per_node`` represents the number of available GPUs:

 .. tab-set::