Unverified Commit 3edc06c0 authored by Kelly Guo's avatar Kelly Guo Committed by GitHub

Fixes distributed training hanging issue (#3273)

# Description

We have been hunting down a strange issue in distributed training setups
with rendering enabled, where often the process would hang midway
through training and causes NCCL timeouts. A workaround was discovered
to set `app.execution.debug.forceSerial = true`, which forces serialized
scheduling of omni graph within the same thread. This appears to have
resolved the hanging issue and did not cause performance regressions.

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
parent 66f48774
......@@ -83,6 +83,9 @@ app.updateOrder.checkForHydraRenderComplete = 1000
app.renderer.waitIdle=true
app.hydraEngine.waitIdle=true
# Forces serial processing for omni graph to avoid NCCL timeout hangs in distributed training
app.execution.debug.forceSerial = true
app.audio.enabled = false
# Enable Vulkan - avoids torch+cu12 error on windows
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment