Fixes docs for Mimic Cosmos pipeline (#493)

# Description Fixes to Mimic-Cosmos documentation as per QA feedback. ## Type of change - Bug fix (non-breaking change which fixes an issue) - This change requires a documentation update ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Fixes docs for Mimic Cosmos pipeline (#493)
# Description Fixes to Mimic-Cosmos documentation as per QA feedback. ## Type of change - Bug fix (non-breaking change which fixes an issue) - This change requires a documentation update ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
21009741 · shauryadNv · Kelly Guo · 2afe12cf · 21009741
Commit 21009741 authored Jun 18, 2025 by shauryadNv Committed by Kelly Guo Jul 12, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 45 additions and 8 deletions

augmented_imitation.rst docs/source/overview/augmented_imitation.rst +45 -8

No files found.
--- a/docs/source/overview/augmented_imitation.rst
+++ b/docs/source/overview/augmented_imitation.rst
@@ -24,7 +24,7 @@ In the following example, we will show you how to use Isaac Lab Mimic to generat
    --device cpu --enable_cameras --headless --num_envs 10 --generation_num_trials 1000 \
    --input_file ./datasets/annotated_dataset.hdf5 --output_file ./datasets/mimic_dataset_1k.hdf5 \
    --task Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-Mimic-v0 \
-    --rendering_mode performance
+    --rendering_mode balanced
 The number of demonstrations can be increased or decreased, 1000 demonstrations have been shown to provide good training results for this task.
@@ -92,14 +92,14 @@ We use the RGB, depth and shaded segmentation videos from the previous step as i
   :align: center
   :alt: RGB, depth and segmentation control inputs to Cosmos
-We provide an example augmentation output from `Cosmos Transfer1 <https://github.com/nvidia-cosmos/cosmos-transfer1>`_ below:
+We provide an example augmentation output from `Cosmos Transfer1 <https://github.com/nvidia-cosmos/cosmos-transfer1/tree/1dca0bd1b3060d112bfec23c287572beabff9758>`_ below:
 .. figure:: https://download.isaacsim.omniverse.nvidia.com/isaaclab/images/cosmos_output.gif
   :width: 100%
   :align: center
   :alt: Cosmos Transfer1 augmentation output
-We recommend using the `Cosmos Transfer1 <https://github.com/nvidia-cosmos/cosmos-transfer1>`_ model for visual augmentation as we found it to produce the best results in the form of a highly diverse dataset with a wide range of visual variations. You can refer to `this example <https://github.com/nvidia-cosmos/cosmos-transfer1/blob/main/examples/inference_cosmos_transfer1_7b.md#example-2-multimodal-control>`_ for reference on how to use Transfer1 for this usecase. We further recommend the following settings to be used with the Transfer1 model for this task:
+We recommend using the `Cosmos Transfer1 <https://github.com/nvidia-cosmos/cosmos-transfer1/tree/1dca0bd1b3060d112bfec23c287572beabff9758>`_ model for visual augmentation as we found it to produce the best results in the form of a highly diverse dataset with a wide range of visual variations. You can refer to `this example <https://github.com/nvidia-cosmos/cosmos-transfer1/blob/1dca0bd1b3060d112bfec23c287572beabff9758/examples/inference_cosmos_transfer1_7b.md#example-2-multimodal-control>`_ for reference on how to use Transfer1 for this usecase. We further recommend the following settings to be used with the Transfer1 model for this task:
 .. rubric:: Hyperparameters
@@ -109,16 +109,12 @@ We recommend using the `Cosmos Transfer1 <https://github.com/nvidia-cosmos/cosmo
    * - ``negative_prompt``
      - "The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic and appears cg. Plane background."
-    * - ``positive_prompt``
-      - "realistic, photorealistic, high fidelity, varied lighting, varied background"
    * - ``sigma_max``
      - 50
    * - ``control_weight``
      - "0.3,0.3,0.6,0.7"
    * - ``hint_key``
      - "blur,canny,depth,segmentation"
-    * - ``control_input_preset_strength``
-      - "low"
 Another crucial aspect to get good augmentations is the set of prompts used to control the Cosmos generation. We provide a script, ``cosmos_prompt_gen.py``, to construct prompts from a set of carefully chosen templates that handle various aspects of the augmentation process.
@@ -158,6 +154,47 @@ In case you want to create your own prompts, we suggest you refer to the followi
 4. It is vital to include details on key aspects of the input control video(s) that should be retained or left unchanged. In our prompts, we very clearly mention that the cube colors should be left unchanged such that the bottom cube is blue, the middle is red and the top is green. Note that we not only mention what should be left unchanged but also give details on what form that aspect currently has.
+Example command to use the Cosmos Transfer1 model for this usecase:
+.. code:: bash
+    export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
+    export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
+    export NUM_GPU="${NUM_GPU:=1}"
+    PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
+        --checkpoint_dir $CHECKPOINT_DIR \
+        --video_save_folder outputs/mimic_cosmos \
+        --controlnet_specs ./controlnet_specs/demo_0.json \
+        --offload_text_encoder_model \
+        --offload_guardrail_models \
+        --num_gpus $NUM_GPU
+Example ``./controlnet_specs/demo_0.json`` json file to use with the above command:
+.. code:: json
+    {
+        "prompt": "A robotic arm is picking up and stacking cubes inside a foggy industrial scrapyard at dawn, surrounded by piles of old robotic parts and twisted metal. The background includes large magnetic cranes, rusted conveyor belts, and flickering yellow floodlights struggling to penetrate the fog. The robot arm is bright teal with a glossy surface and silver stripes on the outer edges; the joints rotate smoothly and the pistons reflect a pale cyan hue. The robot arm is mounted on a table that is light oak wood with a natural grain pattern and a glossy varnish that reflects overhead lights softly; small burn marks dot one corner. The arm is connected to the base mounted on the table. The bottom cube is deep blue, the second cube is bright red, and the top cube is vivid green, maintaining their correct order after stacking. Sunlight pouring in from a large, open window bathes the table and robotic arm in a warm golden light. The shadows are soft, and the scene feels natural and inviting with a slight contrast between light and shadow.",
+        "negative_prompt": "The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic and appears cg. Plane background.",
+        "input_video_path" : "mimic_generated_dataset_mp4/demo_0_table_cam.mp4",
+        "sigma_max": 50,
+        "vis": {
+            "input_control": "mimic_generated_dataset_mp4/demo_0_table_cam.mp4",
+            "control_weight": 0.3
+        },
+        "edge": {
+            "control_weight": 0.3
+        },
+        "depth": {
+            "input_control": "mimic_dataset_1k_mp4/demo_0_table_cam_depth.mp4",
+            "control_weight": 0.6
+        },
+        "seg": {
+            "input_control": "mimic_dataset_1k_mp4/demo_0_table_cam_shaded_segmentation.mp4",
+            "control_weight": 0.7
+        }
+    }
 MP4 to HDF5 Conversion
 ^^^^^^^^^^^^^^^^^^^^^^
@@ -346,7 +383,7 @@ Example usage for the cube stacking task:
    --enable_cameras \
    --seeds 0 \
    --num_rollouts 15 \
-    --rendering_mode performance
+    --rendering_mode balanced
 We use the above script to compare models trained with 1000 Mimic-generated demonstrations, 2000 Mimic-generated demonstrations and 2000 Cosmos-Mimic-generated demonstrations (1000 original mimic + 1000 Cosmos augmented) respectively. We use the same seeds (0, 1000 and 5000) for all three models and provide the metrics (averaged across best checkpoints for each seed) below: