Unverified Commit 09590912 authored by Kelly Guo's avatar Kelly Guo Committed by GitHub

Resets cuda device after each app.update call (#2283)

# Description

Calling app.update may change the cuda device that was previously set by
Isaac Lab. This change forces the cuda device to be set back to the
desired device after each app.update call made in SimulationContext in
reset, step, and render. This fixes NCCL errors on distributed setups
for certain environments (especially when rendering is enabled), where
previously it would generate errors that different ranks were running on
the same device.

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)


## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
parent 203955e4
[package]
# Note: Semantic Versioning is used: https://semver.org/
version = "0.36.5"
version = "0.36.6"
# Description
title = "Isaac Lab framework for Robot Learning"
......
Changelog
---------
0.36.6 (2025-04-09)
~~~~~~~~~~~~~~~~~~~
Changed
^^^^^^^
* Added call to set cuda device after each ``app.update()`` call in :class:`~isaaclab.sim.SimulationContext`.
This is now required for multi-GPU workflows because some underlying logic in ``app.update()`` is modifying
the cuda device, which results in NCCL errors on distributed setups.
0.36.5 (2025-04-01)
~~~~~~~~~~~~~~~~~~~
......
......@@ -452,6 +452,9 @@ class SimulationContext(_SimulationContext):
def reset(self, soft: bool = False):
super().reset(soft=soft)
# app.update() may be changing the cuda device in reset, so we force it back to our desired device here
if "cuda" in self.device:
torch.cuda.set_device(self.device)
# enable kinematic rendering with fabric
if self.physics_sim_view:
self.physics_sim_view._backend.initialize_kinematic_bodies()
......@@ -488,6 +491,10 @@ class SimulationContext(_SimulationContext):
# step the simulation
super().step(render=render)
# app.update() may be changing the cuda device in step, so we force it back to our desired device here
if "cuda" in self.device:
torch.cuda.set_device(self.device)
def render(self, mode: RenderMode | None = None):
"""Refreshes the rendering components including UI elements and view-ports depending on the render mode.
......@@ -527,6 +534,10 @@ class SimulationContext(_SimulationContext):
self._app.update()
self.set_setting("/app/player/playSimulations", True)
# app.update() may be changing the cuda device, so we force it back to our desired device here
if "cuda" in self.device:
torch.cuda.set_device(self.device)
"""
Operations - Override (extension)
"""
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment