Fixes distributed training hanging issue (#3273)
# Description We have been hunting down a strange issue in distributed training setups with rendering enabled, where often the process would hang midway through training and causes NCCL timeouts. A workaround was discovered to set `app.execution.debug.forceSerial = true`, which forces serialized scheduling of omni graph within the same thread. This appears to have resolved the hanging issue and did not cause performance regressions. ## Type of change <!-- As you go through the list, delete the ones that are not applicable. --> - Bug fix (non-breaking change which fixes an issue) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already exists there <!-- As you go through the checklist above, you can mark something as done by putting an x character in it For example, - [x] I have done this task - [ ] I have not done this task -->
Showing
Please register or sign in to comment