Clarifies Ray Documentation and Fixes Minor Issues (#1717)

# Description This PR cleans up the Ray documentation to be more clear, and fixes some small issues in the code. - Moved local set up to be an easier Docker-based thing - Added Wget to docker container to fix issue where Ray head would fail health check on GKE where the workers wouldn't start - Removed redundant information from Documentation - Added a local quickstart - (investigated whether https mlflow was possible, added flag to jinja env)@kellyguo11 - Added better compatibility with other workflows to address #1703 - Avoided early exit due to buffer overflow to address #1703 (thank you @giulioturrisi for helping find this)  ## Type of change  - Bug fix (non-breaking change which fixes an issue) - This change requires a documentation update ![image](https://github.com/user-attachments/assets/eb38b3c8-8e9c-438d-9218-8b0662146f96) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: garylvov <67614381+garylvov@users.noreply.github.com> Co-authored-by: Gary Lvov <glvov@theaiinstitute.com>

Clarifies Ray Documentation and Fixes Minor Issues (#1717)
# Description This PR cleans up the Ray documentation to be more clear, and fixes some small issues in the code. - Moved local set up to be an easier Docker-based thing - Added Wget to docker container to fix issue where Ray head would fail health check on GKE where the workers wouldn't start - Removed redundant information from Documentation - Added a local quickstart - (investigated whether https mlflow was possible, added flag to jinja env)@kellyguo11 - Added better compatibility with other workflows to address #1703 - Avoided early exit due to buffer overflow to address #1703 (thank you @giulioturrisi for helping find this)  ## Type of change  - Bug fix (non-breaking change which fixes an issue) - This change requires a documentation update ![image](https://github.com/user-attachments/assets/eb38b3c8-8e9c-438d-9218-8b0662146f96) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: garylvov <67614381+garylvov@users.noreply.github.com> Co-authored-by: Gary Lvov <glvov@theaiinstitute.com>
ea766783 · garylvov · GitHub · 21173c3e · ea766783 · ea766783
Unverified Commit ea766783 authored Jan 27, 2025 by garylvov Committed by GitHub Jan 27, 2025
12 changed files
--- a/docs/source/features/ray.rst
+++ b/docs/source/features/ray.rst
@@ -20,33 +20,33 @@ the general workflow is the same.
  :depth: 3
  :local:

-Overview
--------
+**Overview**
+------------

-The Ray integration is useful for the following:
+The Ray integration is useful for the following.

- Dispatching several training jobs in parallel or sequentially with minimal interaction
- Tuning hyperparameters; in parallel or sequentially with support for multiple GPUs and/or multiple GPU Nodes
- Using the same training setup everywhere (on cloud and local) with minimal overhead
- Resource Isolation for training jobs
+- Dispatching several training jobs in parallel or sequentially with minimal interaction.
+- Tuning hyperparameters; in parallel or sequentially with support for multiple GPUs and/or multiple GPU Nodes.
+- Using the same training setup everywhere (on cloud and local) with minimal overhead.
+- Resource Isolation for training jobs (resource-wrapped jobs).

 The core functionality of the Ray workflow consists of two main scripts that enable the orchestration
-of resource-wrapped and tuning aggregate jobs. These scripts facilitate the decomposition of
-aggregate jobs (overarching experiments) into individual jobs, which are discrete commands
-executed on the cluster. An aggregate job can include multiple individual jobs.
-For clarity, this guide refers to the jobs one layer below the topmost aggregate level as sub-jobs.
+of resource-wrapped and tuning aggregate jobs. In resource-wrapped aggregate jobs, each sub-job and its
+resource requirements are defined manually, enabling resource isolation.
+For tuning aggregate jobs, individual jobs are generated automatically based on a hyperparameter
+sweep configuration.

 Both resource-wrapped and tuning aggregate jobs dispatch individual jobs to a designated Ray
 cluster, which leverages the cluster's resources (e.g., a single workstation node or multiple nodes)
-to execute these jobs with workers in parallel and/or sequentially. By default, aggregate jobs use all \
+to execute these jobs with workers in parallel and/or sequentially.
+
+By default, jobs use all \
 available resources on each available GPU-enabled node for each sub-job worker. This can be changed through
-specifying the ``--num_workers`` argument, especially critical for parallel aggregate
-job processing on local or virtual multi-GPU machines
+specifying the ``--num_workers`` argument for resource-wrapped jobs, or ``--num_workers_per_node``
+for tuning jobs, which is especially critical for parallel aggregate
+job processing on local/virtual multi-GPU machines. Tuning jobs assume homogeneous node resource composition for nodes with GPUs.

-In resource-wrapped aggregate jobs, each sub-job and its
-resource requirements are defined manually, enabling resource isolation.
-For tuning aggregate jobs, individual jobs are generated automatically based on a hyperparameter
-sweep configuration. This assumes homogeneous node resource composition for nodes with GPUs.
+The two following files contain the core functionality of the Ray integration.

 .. dropdown:: source/standalone/workflows/ray/wrap_resources.py
  :icon: code
@@ -66,7 +66,7 @@ sweep configuration. This assumes homogeneous node resource composition for node
 The following script can be used to submit aggregate
 jobs to one or more Ray cluster(s), which can be used for
 running jobs on a remote cluster or simultaneous jobs with heterogeneous
-resource requirements:
+resource requirements.

 .. dropdown:: source/standalone/workflows/ray/submit_job.py
  :icon: code
@@ -75,7 +75,7 @@ resource requirements:
    :language: python
    :emphasize-lines: 12-53

-The following script can be used to extract KubeRay Cluster information for aggregate job submission.
+The following script can be used to extract KubeRay cluster information for aggregate job submission.

 .. dropdown:: source/standalone/workflows/ray/grok_cluster_with_kubectl.py
  :icon: code
@@ -93,106 +93,56 @@ The following script can be used to easily create clusters on Google GKE.
    :language: python
    :emphasize-lines: 16-37

-**Installation**
----------------
-
-The Ray functionality requires additional dependencies be installed.
-
-To use Ray without Kubernetes, like on a local computer or VM,
-``kubectl`` is not required. For use on Kubernetes clusters with KubeRay,
-such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
-be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_
-
-The pythonic dependencies can be installed with:
+**Docker-based Local Quickstart**
+-----------------------------------

-.. code-block:: bash
+First, follow the `Docker Guide <https://isaac-sim.github.io/IsaacLab/main/source/deployment/docker.html>`_
+to set up the NVIDIA Container Toolkit and Docker Compose.

-  # For multi-run support and resource isolation
-  ./isaaclab.sh -p -m pip install ray[default]==2.31.0
-  # For hyperparameter tuning
-  ./isaaclab.sh -p -m pip install ray[tune]==2.31.0
-  ./isaaclab.sh -p -m pip install optuna bayesian-optimization
-  # MLFlow is needed only for fetching logs on clusters, not needed for local
-  ./isaaclab.sh -p -m pip install mlflow
-
-If using KubeRay clusters on Google GKE with the batteries-included cluster launch file,
-the following dependencies are also needed.
-
-.. code-block:: bash
-
-  ./isaaclab.sh -p -m pip install kubernetes Jinja2
-
-**Setup Overview: Cluster Configuration**
-----------------------------------------
-
-Select one of the following methods to create a Ray Cluster to accept and execute dispatched jobs.
-
-Single-Node Ray Cluster (Recommended for Beginners)
-'''''''''''''''''''''''''''''''''''''''''''''''''''
-For use on a single machine (node) such as a local computer or VM, the
-following command can be used start a ray server. This is compatible with
-multiple-GPU machines. This Ray server will run indefinitely until it is stopped with ``CTRL + C``
+Then, run the following steps to start a tuning run.

 .. code-block:: bash

+  # Build the base image, but we don't need to run it
+  python3 docker/container.py start && python3 docker/container.py stop
+  # Build the tuning image with extra deps
+  docker build -t isaacray -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
+  # Start the tuning image - symlink so that changes in the source folder show up in the container
+  docker run -v $(pwd)/source:/workspace/isaaclab/source -it --gpus all --net=host --entrypoint /bin/bash isaacray
+  # Start the Ray server within the tuning image
  echo "import ray; ray.init(); import time; [time.sleep(10) for _ in iter(int, 1)]" | ./isaaclab.sh -p

-KubeRay Clusters
-''''''''''''''''
-.. attention::
-  The ``ray`` command should be modified to use Isaac python, which could be achieved in a fashion similar to
-  ``sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
-  /isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray``.
-
-Google Cloud is currently the only platform tested, although
-any cloud provider should work if one configures the following:
-
- An container registry (NGC, GCS artifact registry, AWS ECR, etc) with
-  an Isaac Lab image configured to support Ray. See ``cluster_configs/Dockerfile`` to see how to modify the ``isaac-lab-base``
-  container for Ray compatibility. Ray should use the isaac sim python shebang, and ``nvidia-smi``
-  should work within the container. Be careful with the setup here as
-  paths need to be configured correctly for everything to work. It's likely that
-  the example dockerfile will work out of the box and can be pushed to the registry, as
-  long as the base image has already been built as in the container guide
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
-  that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
-  permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
-  provided that your account or organization has been granted a GPU-budget. It is recommended
-  to use manual kubernetes services as opposed to "autopilot" services for cost-effective
-  experimentation as this way clusters can be completely shut down when not in use, although
-  this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_
- An MLFlow server that your cluster has access to.
- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
-  Google Cloud, which can be referenced for the format and MLFlow integration)
-
-Ray Clusters (Without Kubernetes)
-'''''''''''''''''''''''''''''''''
-.. attention::
-  Modify the Ray command to use Isaac Python like in KubeRay Clusters, and follow the same
-  steps for creating an image/cluster permissions/bucket access.

-See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
-`Anyscale <https://www.anyscale.com/product>`_ for more information

+In a different terminal, run the following.

-**Dispatching Jobs and Tuning**
-------------------------------

-Select one of the following guides that matches your desired Cluster configuration.
+.. code-block:: bash

-Simple Ray Cluster (Local/VM)
-'''''''''''''''''''''''''''''
+  # In a new terminal (don't close the above) , enter the image with a new shell.
+  docker container ps
+  docker exec -it <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS> /bin/bash
+  # Start a tuning run, with one parallel worker per GPU
+  ./isaaclab.sh -p source/standalone/workflows/ray/tuner.py \
+    --cfg_file source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
+    --cfg_class CartpoleTheiaJobCfg \
+    --run_mode local \
+    --workflow source/standalone/workflows/rl_games/train.py \
+    --num_workers_per_node <NUMBER_OF_GPUS_IN_COMPUTER>

-This guide assumes that there is a Ray cluster already running, and that this script is run locally on the cluster, or
-that the cluster job submission address is known.

-1.) Testing that the cluster works can be done as follows.
+To view the training logs, in a different terminal, run the following and visit ``localhost:6006`` in a browser afterwards.

 .. code-block:: bash

-  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py --test
+  # In a new terminal (don't close the above) , enter the image with a new shell.
+  docker container ps
+  docker exec -it <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS> /bin/bash
+  # Start a tuning run, with one parallel worker per GPU
+  tensorboard --logdir=.

-2.) Submitting resource-wrapped sub-jobs can be done as described in the following file:
+
+Submitting resource-wrapped individual jobs instead of automatic tuning runs is described in the following file.

 .. dropdown:: source/standalone/workflows/ray/wrap_resources.py
  :icon: code
@@ -201,13 +151,28 @@ that the cluster job submission address is known.
    :language: python
    :emphasize-lines: 14-66

-3.) For tuning jobs, specify the hyperparameter sweep similar to the following two files.
+Transferring files from the running container can be done as follows.
+
+.. code-block:: bash
+
+  docker container ps
+  docker cp <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS>:</path/in/container/file>  </path/on/host/>
+

-.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
+For tuning jobs, specify the tuning job / hyperparameter sweep as child class of :class:`JobCfg` .
+The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in
+environment entrypoints and hydra arguments, although other workflows will work if provided a compatible
+:class:`JobCfg`.
+
+.. dropdown:: source/standalone/workflows/ray/tuner.py (JobCfg definition)
  :icon: code

-  .. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
+  .. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
    :language: python
+    :start-at: class JobCfg
+    :end-at: self.cfg = cfg
+
+For example, see the following Cartpole Example configurations.

 .. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
  :icon: code
@@ -215,42 +180,87 @@ that the cluster job submission address is known.
  .. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
    :language: python

-Then, see the local examples in the following file to see how to start a tuning run.

-.. dropdown:: source/standalone/workflows/ray/tuner.py
-  :icon: code
+**Remote Clusters**
+-------------------------

-  .. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
-    :language: python
-    :emphasize-lines: 18-53
+Select one of the following methods to create a Ray cluster to accept and execute dispatched jobs.
+
+KubeRay Setup
+~~~~~~~~~~~~~
+
+If using KubeRay clusters on Google GKE with the batteries-included cluster launch file,
+the following dependencies are also needed.
+
+.. code-block:: bash

+  python3 -p -m pip install kubernetes Jinja2

+For use on Kubernetes clusters with KubeRay,
+such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
+be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_ .

-To view the logs, simply run ``tensorboard --logdir=<LOCAL_STORAGE_PATH_READ_FROM_OUTPUT>``
+Google Cloud is currently the only platform tested, although
+any cloud provider should work if one configures the following.

-Remote Ray Cluster Setup and Use
-'''''''''''''''''''''''''''''''''
-This guide assumes that one desires to create a cluster on a remote host or server. This
-guide includes shared steps, and KubeRay or Ray specific steps. Follow all shared steps (part I and II), and then
-only the KubeRay or Ray steps depending on your desired configuration, in order of shared steps part I, then
-the configuration specific steps, then shared steps part II.
+.. attention::
+  The ``ray`` command should be modified to use Isaac python, which could be achieved in a fashion similar to
+  ``sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
+  /isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray``.
+
+- An container registry (NGC, GCS artifact registry, AWS ECR, etc) with
+  an Isaac Lab image configured to support Ray. See ``cluster_configs/Dockerfile`` to see how to modify the ``isaac-lab-base``
+  container for Ray compatibility. Ray should use the isaac sim python shebang, and ``nvidia-smi``
+  should work within the container. Be careful with the setup here as
+  paths need to be configured correctly for everything to work. It's likely that
+  the example dockerfile will work out of the box and can be pushed to the registry, as
+  long as the base image has already been built as in the container guide.
+- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
+  that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
+  permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
+  provided that your account or organization has been granted a GPU-budget. It is recommended
+  to use manual kubernetes services as opposed to "autopilot" services for cost-effective
+  experimentation as this way clusters can be completely shut down when not in use, although
+  this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_ .
+- An `MLFlow server <https://mlflow.org/docs/latest/getting-started/logging-first-model/step1-tracking-server.html>`_ that your cluster has access to
+  (already included for Google Cloud, which can be referenced for the format and MLFlow integration).
+- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
+  Google Cloud, which can be referenced for the format and MLFlow integration).
+
+Ray Clusters (Without Kubernetes) Setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. attention::
+  Modify the Ray command to use Isaac Python like in KubeRay clusters, and follow the same
+  steps for creating an image/cluster permissions.
+
+See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
+`Anyscale <https://www.anyscale.com/product>`_ for more information.
+
+Also, create an `MLFlow server <https://mlflow.org/docs/latest/getting-started/logging-first-model/step1-tracking-server.html>`_ that your local
+host and cluster have access to.

 Shared Steps Between KubeRay and Pure Ray Part I
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-1.) Build the Isaac Ray image, and upload it to your container registry of choice.
+1.) Install Ray on your local machine.
+
+.. code-block:: bash
+
+  python3 -p -m pip install ray[default]==2.31.0
+
+2.) Build the Isaac Ray image, and upload it to your container registry of choice.

 .. code-block:: bash

  # Login with NGC (nvcr.io) registry first, see docker steps in repo.
-  ./isaaclab.sh -p docker/container.py start
+  python3 docker/container.py start
  # Build the special Isaac Lab Ray Image
  docker build -t <REGISTRY/IMAGE_NAME> -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
  # Push the image to your registry of choice.
  docker push <REGISTRY/IMAGE_NAME>

-KubeRay Specific
-~~~~~~~~~~~~~~~~
+KubeRay Clusters Only
+~~~~~~~~~~~~~~~~~~~~~
 `k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters that can
 easily be installed with ``snap install k9s --devmode``.

@@ -273,11 +283,7 @@ easily be installed with ``snap install k9s --devmode``.
 2.) Create the KubeRay cluster and an MLFlow server for receiving logs
 that your cluster has access to.
 This can be done automatically for Google GKE,
-where instructions are included in the following creation file. More than once cluster
-can be created at once. Each cluster can have heterogeneous resources if so desired,
-although only
-For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
-Google's.
+where instructions are included in the following creation file.

 .. dropdown:: source/standalone/workflows/ray/launch.py
  :icon: code
@@ -286,6 +292,18 @@ Google's.
    :language: python
    :emphasize-lines: 15-37

+For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
+Google's.
+
+
+.. dropdown:: source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.ninja
+  :icon: code
+
+  .. literalinclude:: ../../../source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.jinja
+      :language: python
+
+
+
 3.) Fetch the KubeRay cluster IP addresses, and the MLFLow Server IP.
 This can be done automatically for KubeRay clusters,
 where instructions are included in the following fetching file.
@@ -299,8 +317,8 @@ printed.
    :language: python
    :emphasize-lines: 14-26

-Ray Specific
-~~~~~~~~~~~~
+Ray Clusters Only (Without Kubernetes)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


 1.) Verify cluster access.
@@ -311,17 +329,18 @@ a new line for each unique cluster. For one cluster, there should only be one li
 3.) Start an MLFLow Server to receive the logs that the ray cluster has access to,
 and determine the server URI.

-Shared Steps Between KubeRay and Pure Ray Part II
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Dispatching Steps Shared Between KubeRay and Pure Ray Part II
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
 1.) Test that your cluster is operational with the following.

 .. code-block:: bash

  # Test that NVIDIA GPUs are visible and that Ray is operation with the following command:
-  ./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py
-	--jobs wrap_resources.py --test
+  python3 source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --test

-2.) Submitting Jobs can be done in the following manner, with the following script.
+2.) Submitting tuning and/or resource-wrapped jobs is described in the :file:`submit_job.py` file.

 .. dropdown:: source/standalone/workflows/ray/submit_job.py
  :icon: code
@@ -330,16 +349,20 @@ Shared Steps Between KubeRay and Pure Ray Part II
    :language: python
    :emphasize-lines: 12-53

-3.) For tuning jobs, specify the hyperparameter sweep similar to :class:`RLGamesCameraJobCfg` in the following file:
+3.) For tuning jobs, specify the tuning job / hyperparameter sweep as a :class:`JobCfg` .
+The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in
+environment entrypoints and hydra arguments, although other workflows will work if provided a compatible
+:class:`JobCfg`.

-.. dropdown:: source/standalone/workflows/ray/tuner.py
+.. dropdown:: source/standalone/workflows/ray/tuner.py (JobCfg definition)
  :icon: code

  .. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
    :language: python
-    :emphasize-lines: 18-53
+    :start-at: class JobCfg
+    :end-at: self.cfg = cfg

-For example, see the Cartpole Example configurations.
+For example, see the following Cartpole Example configurations.

 .. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
  :icon: code
@@ -348,14 +371,12 @@ For example, see the Cartpole Example configurations.
    :language: python


-Tuning jobs can also be submitted via ``submit_job.py``
-
 To view the tuning results, view the MLFlow dashboard of the server that you created.
-For KubeRay, this can be done through port forwarding the MLFlow dashboard, with
+For KubeRay, this can be done through port forwarding the MLFlow dashboard with the following.

 ``kubectl port-forward service/isaacray-mlflow 5000:5000``

-and visiting the following address in a browser.
+Then visit the following address in a browser.

 ``localhost:5000``

@@ -366,8 +387,8 @@ this following command.
 --uri http://localhost:5000 --experiment-name IsaacRay-<CLASS_JOB_CFG>-tune --download-dir test``


-**Cluster Cleanup**
-'''''''''''''''''''
+Kubernetes Cluster Cleanup
+''''''''''''''''''''''''''

 For the sake of conserving resources, and potentially freeing precious GPU resources for other people to use
 on shared compute platforms, please destroy the Ray cluster after use. They can be easily
@@ -377,4 +398,5 @@ recreated! For KubeRay clusters, this can be done as follows.

  kubectl get raycluster | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete raycluster &&
  kubectl get deployments | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete deployment &&
-  kubectl get services | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete service
+  kubectl get services | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete service &&
+  kubectl get services | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete service
--- a/source/standalone/workflows/ray/cluster_configs/Dockerfile
+++ b/source/standalone/workflows/ray/cluster_configs/Dockerfile
 FROM isaac-lab-base:latest

+# WGet is needed so that GCS or other cloud providers can mark the container as ready.
+# Otherwise the Ray liveliness checks fail.
+RUN apt-get update && apt-get install wget
+
 # Set NVIDIA paths
 ENV PATH="/usr/local/nvidia/bin:$PATH"
 ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64"

--- a/source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.jinja
+++ b/source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.jinja
@@ -19,7 +19,6 @@ spec:
      block: "true"
      dashboard-host: 0.0.0.0
      dashboard-port: "8265"
-      node-ip-address: "0.0.0.0"
      port: "6379"
      include-dashboard: "true"
      ray-debugger-external: "true"
@@ -30,7 +29,7 @@ spec:
      apiVersion: v1
      kind: Service
      metadata:
-        name: head
+        name: {{ name }}-head
      spec:
        type: LoadBalancer
    template:
@@ -130,7 +129,7 @@ spec:
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
-              command: ["/bin/bash", "-c", "ray start --address=head.{{ namespace }}.svc.cluster.local:6379 && tail -f /dev/null"]
+              command: ["/bin/bash", "-c", "ray start --address={{name}}-head.{{ namespace }}.svc.cluster.local:6379 && tail -f /dev/null"]
            - image: fluent/fluent-bit:1.9.6
              name: fluentbit
              resources:

--- a/source/standalone/workflows/ray/grok_cluster_with_kubectl.py
+++ b/source/standalone/workflows/ray/grok_cluster_with_kubectl.py
@@ -21,7 +21,7 @@ Usage:

 .. code-block:: bash

-    ./isaaclab.sh -p source/standalone/workflows/ray/grok_cluster_with_kubectl.py
+    python3 source/standalone/workflows/ray/grok_cluster_with_kubectl.py
    # For options, supply -h arg
 """

@@ -67,9 +67,10 @@ def get_clusters(pods: list, cluster_name_prefix: str) -> set:

        match = re.match(r"(" + re.escape(cluster_name_prefix) + r"[-\w]+)", pod_name)
        if match:
-            # Get base name without head/worker suffix
-            base_name = match.group(1).split("-head")[0].split("-worker")[0]
-            clusters.add(base_name)
+            # Get base name without head/worker suffix (skip workers)
+            if "head" in pod_name:
+                base_name = match.group(1).split("-head")[0]
+                clusters.add(base_name)
    return sorted(clusters)


@@ -90,9 +91,7 @@ def get_mlflow_info(namespace: str = None, cluster_prefix: str = "isaacray") ->
    clusters = get_clusters(pods=pods, cluster_name_prefix=cluster_prefix)
    if len(clusters) > 1:
        raise ValueError("More than one cluster matches prefix, could not automatically determine mlflow info.")
-
-    base_name = cluster_prefix.split("-head")[0].split("-worker")[0]
-    mlflow_name = f"{base_name}-mlflow"
+    mlflow_name = f"{cluster_prefix}-mlflow"

    cmd = ["kubectl", "get", "svc", mlflow_name, "-n", namespace, "--no-headers"]
    try:
@@ -102,7 +101,8 @@ def get_mlflow_info(namespace: str = None, cluster_prefix: str = "isaacray") ->
        # Get cluster IP
        cluster_ip = fields[2]
        port = "5000"  # Default MLflow port
-
+        # This needs to be http to be resolved. HTTPS can't be resolved
+        # This should be fine as it is on a subnet on the cluster regardless
        return f"http://{cluster_ip}:{port}"
    except subprocess.CalledProcessError as e:
        raise ValueError(f"Could not grok MLflow: {e}")  # Fixed f-string

--- a/source/standalone/workflows/ray/launch.py
+++ b/source/standalone/workflows/ray/launch.py
@@ -8,29 +8,28 @@ import pathlib
 import subprocess
 import yaml

+import util
 from jinja2 import Environment, FileSystemLoader
 from kubernetes import config

-import source.standalone.workflows.ray.util as util
-
 """This script helps create one or more KubeRay clusters.

 Usage:

 .. code-block:: bash
    # If the head node is stuck on container creating, make sure to create a secret
-    ./isaaclab.sh -p source/standalone/workflows/ray/launch.py -h
+    python3 source/standalone/workflows/ray/launch.py -h

    # Examples

    # The following creates 8 GPUx1 nvidia l4 workers
-    ./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
+    python3 source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
    --namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
    --num_workers 8 --num_clusters 1 --worker_accelerator nvidia-l4 --gpu_per_worker 1

    # The following creates 1 GPUx1 nvidia l4 worker, 2 GPUx2 nvidia-tesla-t4 workers,
    # and 2 GPUx4 nvidia-tesla-t4 GPU workers
-    ./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
+    python3 source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
    --namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
    --num_workers 1 2 --num_clusters 1 \
    --worker_accelerator nvidia-l4 nvidia-tesla-t4 --gpu_per_worker 1 2 4
@@ -53,7 +52,7 @@ def apply_manifest(args: argparse.Namespace) -> None:
    # Set up Jinja2 environment for loading templates
    templates_dir = RAY_DIR / "cluster_configs" / args.cluster_host
    file_loader = FileSystemLoader(str(templates_dir))
-    jinja_env = Environment(loader=file_loader, keep_trailing_newline=True)
+    jinja_env = Environment(loader=file_loader, keep_trailing_newline=True, autoescape=True)

    # Define template filename
    template_file = "kuberay.yaml.jinja"
@@ -79,6 +78,7 @@ def apply_manifest(args: argparse.Namespace) -> None:

    # Apply the Kubernetes manifest using kubectl
    try:
+        print(cleaned_yaml_string)
        subprocess.run(["kubectl", "apply", "-f", "-"], input=cleaned_yaml_string, text=True, check=True)
    except subprocess.CalledProcessError as e:
        exit(f"An error occurred while running `kubectl`: {e}")

--- a/source/standalone/workflows/ray/submit_job.py
+++ b/source/standalone/workflows/ray/submit_job.py
@@ -40,16 +40,16 @@ Usage:
 .. code-block:: bash

    # Example; submitting a tuning job
-    ./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
+    python3 source/standalone/workflows/ray/submit_job.py \
    --aggregate_jobs /workspace/isaaclab/source/standalone/workflows/ray/tuner.py \
        --cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
-        --cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <ML_FLOW_URI>
+        --cfg_class CartpoleTheiaJobCfg --mlflow_uri <ML_FLOW_URI>

    # Example: Submitting resource wrapped job
-    ./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --sub_jobs ./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-v0 --headless+./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-RGB-Camera-Direct-v0 --headless --enable_cameras agent.params.config.max_epochs=150
+    python3 source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --test

    # For all command line arguments
-    ./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py -h
+    python3 source/standalone/workflows/ray/submit_job.py -h
 """
 script_directory = os.path.dirname(os.path.abspath(__file__))
 CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"}

--- a/source/standalone/workflows/ray/tuner.py
+++ b/source/standalone/workflows/ray/tuner.py
@@ -17,8 +17,9 @@ from ray.tune.search.repeater import Repeater
 """
 This script breaks down an aggregate tuning job, as defined by a hyperparameter sweep configuration,
 into individual jobs (shell commands) to run on the GPU-enabled nodes of the cluster.
-By default, (unless combined as a sub-job in a resource-wrapped aggregate job), one worker is created
-for each GPU-enabled node in the cluster for each individual job.
+By default, one worker is created for each GPU-enabled node in the cluster for each individual job.
+To use more than one worker per node (likely the case for multi-GPU machines), supply the
+num_workers_per_node argument.

 Each hyperparameter sweep configuration should include the workflow,
 runner arguments, and hydra arguments to vary.
@@ -39,16 +40,15 @@ Usage:
    ./isaaclab.sh -p source/standalone/workflows/ray/tuner.py -h

    # Examples
-    # Local (not within a docker container, when within a local docker container, do not supply run_mode argument)
+    # Local
    ./isaaclab.sh -p source/standalone/workflows/ray/tuner.py --run_mode local \
    --cfg_file source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
-    --cfg_class CartpoleRGBNoTuneJobCfg
-    # Local docker: start the ray server and run above command in the same running container without run_mode arg
+    --cfg_class CartpoleTheiaJobCfg
    # Remote (run grok cluster or create config file mentioned in :file:`submit_job.py`)
    ./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
    --aggregate_jobs tuner.py \
    --cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
-    --cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <MLFLOW_URI_FROM_GROK_OR_MANUAL>
+    --cfg_class CartpoleTheiaJobCfg --mlflow_uri <MLFLOW_URI_FROM_GROK_OR_MANUAL>

 """

@@ -74,7 +74,7 @@ class IsaacLabTuneTrainable(tune.Trainable):
        print(f"[INFO]: Recovered invocation with {self.invoke_cmd}")
        self.experiment = None

-    def reset_config(self, new_config):
+    def reset_config(self, new_config: dict):
        """Allow environments to be re-used by fetching a new invocation command"""
        self.setup(new_config)
        return True
@@ -95,15 +95,15 @@ class IsaacLabTuneTrainable(tune.Trainable):
            self.proc = experiment["proc"]
            self.experiment_name = experiment["experiment_name"]
            self.isaac_logdir = experiment["logdir"]
-            self.tensorboard_logdir = self.isaac_logdir + f"/{self.experiment_name}/summaries"
+            self.tensorboard_logdir = self.isaac_logdir + "/" + self.experiment_name
            self.done = False

        if self.proc is None:
            raise ValueError("Could not start trial.")
-
-        if self.proc.poll() is not None:  # process finished, signal finish
+        proc_status = self.proc.poll()
+        if proc_status is not None:  # process finished, signal finish
            self.data["done"] = True
-            print("[INFO]: Process finished, returning...")
+            print(f"[INFO]: Process finished with {proc_status}, returning...")
        else:  # wait until the logs are ready or fresh
            data = util.load_tensorboard_logs(self.tensorboard_logdir)

@@ -220,10 +220,24 @@ class JobCfg:
    """To be compatible with :meth: invoke_tuning_run and :class:IsaacLabTuneTrainable,
    at a minimum, the tune job should inherit from this class."""

-    def __init__(self, cfg):
+    def __init__(self, cfg: dict):
+        """
+        Runner args include command line arguments passed to the task.
+        For example:
+        cfg["runner_args"]["headless_singleton"] = "--headless"
+        cfg["runner_args"]["enable_cameras_singleton"] = "--enable_cameras"
+        """
        assert "runner_args" in cfg, "No runner arguments specified."
+        """
+        Task is the desired task to train on. For example:
+        cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-TheiaTiny-v0"])
+        """
        assert "--task" in cfg["runner_args"], "No task specified."
-        assert "hydra_args" in cfg, "No hypeparameters specified."
+        """
+        Hydra args define the hyperparameters varied within the sweep. For example:
+        cfg["hydra_args"]["agent.params.network.cnn.activation"] = tune.choice(["relu", "elu"])
+        """
+        assert "hydra_args" in cfg, "No hyperparameters specified."
        self.cfg = cfg



--- a/source/standalone/workflows/ray/util.py
+++ b/source/standalone/workflows/ray/util.py
@@ -6,15 +6,18 @@ import argparse
 import os
 import re
 import subprocess
+import threading
 from datetime import datetime
 from math import isclose

 import ray
+from tensorboard.backend.event_processing.directory_watcher import DirectoryDeletedError
 from tensorboard.backend.event_processing.event_accumulator import EventAccumulator


 def load_tensorboard_logs(directory: str) -> dict:
-    """From a tensorboard directory, get the latest scalar values.
+    """From a tensorboard directory, get the latest scalar values. If the logs can't be
+    found, check the summaries sublevel.

    Args:
        directory: The directory of the tensorboard logging.
@@ -22,19 +25,23 @@ def load_tensorboard_logs(directory: str) -> dict:
    Returns:
        The latest available scalar values.
    """
+
    # Initialize the event accumulator with a size guidance for only the latest entry
-    size_guidance = {"scalars": 1}  # Load only the latest entry for scalars
-    event_acc = EventAccumulator(directory, size_guidance=size_guidance)
-    event_acc.Reload()  # Load all data from the directory
+    def get_latest_scalars(path: str) -> dict:
+        event_acc = EventAccumulator(path, size_guidance={"scalars": 1})
+        try:
+            event_acc.Reload()
+            if event_acc.Tags()["scalars"]:
+                return {
+                    tag: event_acc.Scalars(tag)[-1].value
+                    for tag in event_acc.Tags()["scalars"]
+                    if event_acc.Scalars(tag)
+                }
+        except (KeyError, OSError, RuntimeError, DirectoryDeletedError):
+            return {}

-    # Extract the latest scalars logged
-    latest_scalars = {}
-    for tag in event_acc.Tags()["scalars"]:
-        events = event_acc.Scalars(tag)
-        if events:  # Check if there is at least one entry
-            latest_event = events[-1]  # Get the latest entry
-            latest_scalars[tag] = latest_event.value
-    return latest_scalars
+    scalars = get_latest_scalars(directory)
+    return scalars or get_latest_scalars(os.path.join(directory, "summaries"))


 def get_invocation_command_from_cfg(
@@ -190,47 +197,62 @@ def execute_job(
        experiment_info_pattern = re.compile("Exact experiment name requested from command line: (.+)")
        logdir_pattern = re.compile(r"\[INFO\] Logging experiment in directory: (.+)$")
        err_pattern = re.compile("There was an error (.+)$")
-        with process.stdout as stdout:
-            for line in iter(stdout.readline, ""):
+
+        def stream_reader(stream, identifier_string, result_details):
+            for line in iter(stream.readline, ""):
                line = line.strip()
-                result_details.append(f"{identifier_string}: {line} \n")
+                result_details.append(f"{identifier_string}: {line}\n")
                if log_all_output:
                    print(f"{identifier_string}: {line}")

-                if extract_experiment:
-                    exp_match = experiment_info_pattern.search(line)
-                    log_match = logdir_pattern.search(line)
-                    err_match = err_pattern.search(line)
-                    if err_match:
-                        raise ValueError(f"Encountered an error during trial run. {' '.join(result_details)}")
-
-                    if exp_match:
-                        experiment_name = exp_match.group(1)
-                    if log_match:
-                        logdir = log_match.group(1)
-
-                    if experiment_name and logdir:
-                        result = {
-                            "experiment_name": experiment_name,
-                            "logdir": logdir,
-                            "proc": process,
-                            "result": " ".join(result_details),
-                        }
-                        return result
-
-        with process.stderr as stderr:
-            for line in iter(stderr.readline, ""):
-                line = line.strip()
-                result_details.append(f"{identifier_string}: {line}")
+        # Read stdout until we find experiment info
+        # Do some careful handling prevent overflowing the pipe reading buffer with error 141
+        for line in iter(process.stdout.readline, ""):
+            line = line.strip()
+            result_details.append(f"{identifier_string}: {line} \n")
+            if log_all_output:
                print(f"{identifier_string}: {line}")

-        process.wait()  # Wait for the subprocess to finish naturally if not exited early
-
-    now = datetime.now().strftime("%H:%M:%S.%f")
-    completion_info = f"\n[INFO]: {identifier_string}: Job Started at {start_time}, completed at {now}\n"
-    print(completion_info)
-    result_details.append(completion_info)
-    return " ".join(result_details)
+            if extract_experiment:
+                exp_match = experiment_info_pattern.search(line)
+                log_match = logdir_pattern.search(line)
+                err_match = err_pattern.search(line)
+
+                if err_match:
+                    raise ValueError(f"Encountered an error during trial run. {' '.join(result_details)}")
+
+                if exp_match:
+                    experiment_name = exp_match.group(1)
+                if log_match:
+                    logdir = log_match.group(1)
+
+                if experiment_name and logdir:
+                    # Start stderr reader after finding experiment info
+                    stderr_thread = threading.Thread(
+                        target=stream_reader, args=(process.stderr, identifier_string, result_details)
+                    )
+                    stderr_thread.daemon = True
+                    stderr_thread.start()
+
+                    # Start stdout reader to continue reading to flush buffer
+                    stdout_thread = threading.Thread(
+                        target=stream_reader, args=(process.stdout, identifier_string, result_details)
+                    )
+                    stdout_thread.daemon = True
+                    stdout_thread.start()
+
+                    return {
+                        "experiment_name": experiment_name,
+                        "logdir": logdir,
+                        "proc": process,
+                        "result": " ".join(result_details),
+                    }
+        process.wait()
+        now = datetime.now().strftime("%H:%M:%S.%f")
+        completion_info = f"\n[INFO]: {identifier_string}: Job Started at {start_time}, completed at {now}\n"
+        print(completion_info)
+        result_details.append(completion_info)
+        return " ".join(result_details)


 def get_gpu_node_resources(

--- a/source/standalone/workflows/ray/wrap_resources.py
+++ b/source/standalone/workflows/ray/wrap_resources.py
@@ -6,12 +6,11 @@
 import argparse

 import ray
+import util
 from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

-import source.standalone.workflows.ray.util as util
-
 """
-This script dispatches sub-job(s) (either individual jobs or tuning aggregate jobs)
+This script dispatches sub-job(s) (individual jobs, use :file:`tuner.py` for tuning jobs)
 to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate
 job. If no desired compute resources for each sub-job are specified,
 this script creates one worker per available node for each node with GPU(s) in the cluster.

--- a/source/standalone/workflows/rsl_rl/train.py
+++ b/source/standalone/workflows/rsl_rl/train.py
@@ -93,6 +93,8 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    print(f"[INFO] Logging experiment in directory: {log_root_path}")
    # specify directory for logging runs: {time-stamp}_{run_name}
    log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+    # This way, the Ray Tune workflow can extract experiment name.
+    print(f"Exact experiment name requested from command line: {log_dir}")
    if agent_cfg.run_name:
        log_dir += f"_{agent_cfg.run_name}"
    log_dir = os.path.join(log_root_path, log_dir)

--- a/source/standalone/workflows/sb3/train.py
+++ b/source/standalone/workflows/sb3/train.py
@@ -89,7 +89,11 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device

    # directory for logging into
-    log_dir = os.path.join("logs", "sb3", args_cli.task, datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
+    run_info = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+    log_root_path = os.path.abspath(os.path.join("logs", "sb3", args_cli.task))
+    print(f"[INFO] Logging experiment in directory: {log_root_path}")
+    print(f"Exact experiment name requested from command line: {run_info}")
+    log_dir = os.path.join(log_root_path, run_info)
    # dump the configuration into log-directory
    dump_yaml(os.path.join(log_dir, "params", "env.yaml"), env_cfg)
    dump_yaml(os.path.join(log_dir, "params", "agent.yaml"), agent_cfg)

--- a/source/standalone/workflows/skrl/train.py
+++ b/source/standalone/workflows/skrl/train.py
@@ -135,6 +135,7 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
    print(f"[INFO] Logging experiment in directory: {log_root_path}")
    # specify directory for logging runs: {time-stamp}_{run_name}
    log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + f"_{algorithm}_{args_cli.ml_framework}"
+    print(f"Exact experiment name requested from command line {log_dir}")
    if agent_cfg["agent"]["experiment"]["experiment_name"]:
        log_dir += f'_{agent_cfg["agent"]["experiment"]["experiment_name"]}'
    # set directory into agent config