• Kelly Guo's avatar
    Fixes distributed training hanging issue (#3273) · 3edc06c0
    Kelly Guo authored
    # Description
    
    We have been hunting down a strange issue in distributed training setups
    with rendering enabled, where often the process would hang midway
    through training and causes NCCL timeouts. A workaround was discovered
    to set `app.execution.debug.forceSerial = true`, which forces serialized
    scheduling of omni graph within the same thread. This appears to have
    resolved the hanging issue and did not cause performance regressions.
    
    ## Type of change
    
    <!-- As you go through the list, delete the ones that are not
    applicable. -->
    
    - Bug fix (non-breaking change which fixes an issue)
    
    ## Checklist
    
    - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
    `./isaaclab.sh --format`
    - [x] I have made corresponding changes to the documentation
    - [x] My changes generate no new warnings
    - [ ] I have added tests that prove my fix is effective or that my
    feature works
    - [ ] I have updated the changelog and the corresponding version in the
    extension's `config/extension.toml` file
    - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
    exists there
    
    <!--
    As you go through the checklist above, you can mark something as done by
    putting an x character in it
    
    For example,
    - [x] I have done this task
    - [ ] I have not done this task
    -->
    3edc06c0
isaaclab.python.headless.rendering.kit 5.89 KB