Unverified Commit ec38e601 authored by Kelly Guo's avatar Kelly Guo Committed by GitHub

Updates multi-node training commands to also support Spark (#3978)

# Description

Removes rendezvous backend for multi-node training since it doesn't seem
to be necessary and prevents multi-node setup on the DGX Spark.


## Type of change

- Documentation update


## Checklist

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->

---------
Signed-off-by: 's avatarKelly Guo <kellyg@nvidia.com>
Co-authored-by: 's avatargreptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
parent a736a00b
...@@ -141,14 +141,14 @@ For the master node, use the following command, where ``--nproc_per_node`` repre ...@@ -141,14 +141,14 @@ For the master node, use the following command, where ``--nproc_per_node`` repre
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: rsl_rl .. tab-item:: rsl_rl
:sync: rsl_rl :sync: rsl_rl
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: skrl .. tab-item:: skrl
:sync: skrl :sync: skrl
...@@ -160,7 +160,7 @@ For the master node, use the following command, where ``--nproc_per_node`` repre ...@@ -160,7 +160,7 @@ For the master node, use the following command, where ``--nproc_per_node`` repre
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: JAX .. tab-item:: JAX
:sync: jax :sync: jax
...@@ -181,14 +181,14 @@ For non-master nodes, use the following command, replacing ``--node_rank`` with ...@@ -181,14 +181,14 @@ For non-master nodes, use the following command, replacing ``--node_rank`` with
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 scripts/reinforcement_learning/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/rl_games/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: rsl_rl .. tab-item:: rsl_rl
:sync: rsl_rl :sync: rsl_rl
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: skrl .. tab-item:: skrl
:sync: skrl :sync: skrl
...@@ -200,7 +200,7 @@ For non-master nodes, use the following command, replacing ``--node_rank`` with ...@@ -200,7 +200,7 @@ For non-master nodes, use the following command, replacing ``--node_rank`` with
.. code-block:: shell .. code-block:: shell
python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=ip_of_master_machine:5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed python -m torch.distributed.run --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<ip_of_master> --master_port=5555 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed
.. tab-item:: JAX .. tab-item:: JAX
:sync: jax :sync: jax
......
...@@ -89,8 +89,6 @@ Other notable limitations with respect to Isaac Lab include... ...@@ -89,8 +89,6 @@ Other notable limitations with respect to Isaac Lab include...
#. Livestream and Hub Workstation Cache are not supported on the DGX spark. #. Livestream and Hub Workstation Cache are not supported on the DGX spark.
#. Multi-node training may require direct connections between Spark machines or additional network configurations.
#. :ref:`Running Cosmos Transfer1 <running-cosmos>` is not currently supported on the DGX Spark. #. :ref:`Running Cosmos Transfer1 <running-cosmos>` is not currently supported on the DGX Spark.
Troubleshooting Troubleshooting
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment