Commit aa421304 authored by Alex-Omar-Nvidia's avatar Alex-Omar-Nvidia Committed by Kelly Guo

Update documentation on pytorch multi gpu setup (#2687)

# Description

<!--
Thank you for your interest in sending a pull request. Please make sure
to check the contribution guidelines.

Link:
https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html
-->

Update the Multi GPU documentation to include more information about how
we integrate with Pytorch and include more documentation links.

Fixes # (issue)

<!-- As a practice, it is recommended to open an issue to have
discussions on the proposed pull request.
This makes it easier for the community to keep track of what is being
developed or added, and if a given feature
is demanded by more than one party. -->

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- This change requires a documentation update

## Screenshots

Please attach before and after screenshots of the change if applicable.

<!--
Example:

### Original


![isaac-sim-docs-now](https://github.com/user-attachments/assets/3e570291-c95e-4c72-bf0e-c0c4421aa266)

### Updated


![isaac-lab-updated-multi-gpu-docs](https://github.com/user-attachments/assets/3792da24-5269-42b7-b23f-9722bce4398c)

-->

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
parent ed584581
......@@ -37,6 +37,7 @@ Guidelines for modifications:
## Contributors
* Alessandro Assirelli
* Alex Omar
* Alice Zhou
* Amr Mousa
* Andrej Orsula
......
......@@ -16,19 +16,54 @@ other workflows.
Multi-GPU Training
------------------
For complex reinforcement learning environments, it may be desirable to scale up training across multiple GPUs.
This is possible in Isaac Lab through the use of the
`PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_ framework or the
`JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_ module respectively.
In PyTorch, the :meth:`torch.distributed` API is used to launch multiple processes of training, where the number of
processes must be equal to or less than the number of GPUs available. Each process runs on
a dedicated GPU and launches its own instance of Isaac Sim and the Isaac Lab environment.
Each process collects its own rollouts during the training process and has its own copy of the policy
network. During training, gradients are aggregated across the processes and broadcasted back to the process
at the end of the epoch.
In JAX, since the ML framework doesn't automatically start multiple processes from a single program invocation,
Isaac Lab supports the following multi-GPU training frameworks:
* `Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ through `PyTorch distributed <https://pytorch.org/docs/stable/distributed.html>`_
* `JAX distributed <https://jax.readthedocs.io/en/latest/jax.distributed.html>`_
Pytorch Torchrun Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We are using `Pytorch Torchrun <https://docs.pytorch.org/docs/stable/elastic/run.html>`_ to manage multi-GPU
training. Torchrun manages the distributed training by:
* **Process Management**: Launching one process per GPU, where each process is assigned to a specific GPU.
* **Script Execution**: Running the same training script (e.g., RL Games trainer) on each process.
* **Environment Instances**: Each process creates its own instance of the Isaac Lab environment.
* **Gradient Synchronization**: Aggregating gradients across all processes and broadcasting the synchronized
gradients back to each process after each training step.
.. tip::
Check out this `3 minute youtube video from PyTorch <https://www.youtube.com/watch?v=Cvdhwx-OBBo&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=2>`_
to understand how Torchrun works.
The key components in this setup are:
* **Torchrun**: Handles process spawning, communication, and gradient synchronization.
* **RL Library**: The reinforcement learning library that runs the actual training algorithm.
* **Isaac Lab**: Provides the simulation environment that each process instantiates independently.
Under the hood, Torchrun uses the `DistributedDataParallel <https://docs.pytorch.org/docs/2.7/notes/ddp.html#internal-design>`_
module to manage the distributed training. When training with multiple GPUs using Torchrun, the following happens:
* Each GPU runs an independent process
* Each process executes the full training script
* Each process maintains its own:
* Isaac Lab environment instance (with *n* parallel environments)
* Policy network copy
* Experience buffer for rollout collection
* All processes synchronize only for gradient updates
For a deeper dive into how Torchrun works, checkout
`PyTorch Docs: DistributedDataParallel - Internal Design <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`_.
Jax Implementation
^^^^^^^^^^^^^^^^^^
.. tip::
JAX is only supported with the skrl library.
With JAX, we are using `skrl.utils.distributed.jax <https://skrl.readthedocs.io/en/latest/api/utils/distributed.html>`_
Since the ML framework doesn't automatically start multiple processes from a single program invocation,
the skrl library provides a module to start them.
.. image:: ../_static/multi-gpu-rl/a3c-light.svg
......@@ -45,6 +80,9 @@ the skrl library provides a module to start them.
|
Running Multi-GPU Training
^^^^^^^^^^^^^^^^^^^^^^^^^^
To train with multiple GPUs, use the following command, where ``--nproc_per_node`` represents the number of available GPUs:
.. tab-set::
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment