Unverified Commit ea766783 authored by garylvov's avatar garylvov Committed by GitHub

Clarifies Ray Documentation and Fixes Minor Issues (#1717)

# Description
This PR cleans up the Ray documentation to be more clear, and fixes some
small issues in the code.

- Moved local set up to be an easier Docker-based thing
- Added Wget to docker container to fix issue where Ray head would fail
health check on GKE where the workers wouldn't start
- Removed redundant information from Documentation
- Added a local quickstart
- (investigated whether https mlflow was possible, added flag to jinja
env)@kellyguo11
- Added better compatibility with other workflows to address #1703 
- Avoided early exit due to buffer overflow to address #1703 (thank you
@giulioturrisi for helping find this)
<!-- As a practice, it is recommended to open an issue to have
discussions on the proposed pull request.
This makes it easier for the community to keep track of what is being
developed or added, and if a given feature
is demanded by more than one party. -->

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)
- This change requires a documentation update


![image](https://github.com/user-attachments/assets/eb38b3c8-8e9c-438d-9218-8b0662146f96)


## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->

---------
Signed-off-by: 's avatargarylvov <67614381+garylvov@users.noreply.github.com>
Co-authored-by: 's avatarGary Lvov <glvov@theaiinstitute.com>
parent 21173c3e
......@@ -20,33 +20,33 @@ the general workflow is the same.
:depth: 3
:local:
Overview
--------
**Overview**
------------
The Ray integration is useful for the following:
The Ray integration is useful for the following.
- Dispatching several training jobs in parallel or sequentially with minimal interaction
- Tuning hyperparameters; in parallel or sequentially with support for multiple GPUs and/or multiple GPU Nodes
- Using the same training setup everywhere (on cloud and local) with minimal overhead
- Resource Isolation for training jobs
- Dispatching several training jobs in parallel or sequentially with minimal interaction.
- Tuning hyperparameters; in parallel or sequentially with support for multiple GPUs and/or multiple GPU Nodes.
- Using the same training setup everywhere (on cloud and local) with minimal overhead.
- Resource Isolation for training jobs (resource-wrapped jobs).
The core functionality of the Ray workflow consists of two main scripts that enable the orchestration
of resource-wrapped and tuning aggregate jobs. These scripts facilitate the decomposition of
aggregate jobs (overarching experiments) into individual jobs, which are discrete commands
executed on the cluster. An aggregate job can include multiple individual jobs.
For clarity, this guide refers to the jobs one layer below the topmost aggregate level as sub-jobs.
of resource-wrapped and tuning aggregate jobs. In resource-wrapped aggregate jobs, each sub-job and its
resource requirements are defined manually, enabling resource isolation.
For tuning aggregate jobs, individual jobs are generated automatically based on a hyperparameter
sweep configuration.
Both resource-wrapped and tuning aggregate jobs dispatch individual jobs to a designated Ray
cluster, which leverages the cluster's resources (e.g., a single workstation node or multiple nodes)
to execute these jobs with workers in parallel and/or sequentially. By default, aggregate jobs use all \
to execute these jobs with workers in parallel and/or sequentially.
By default, jobs use all \
available resources on each available GPU-enabled node for each sub-job worker. This can be changed through
specifying the ``--num_workers`` argument, especially critical for parallel aggregate
job processing on local or virtual multi-GPU machines
specifying the ``--num_workers`` argument for resource-wrapped jobs, or ``--num_workers_per_node``
for tuning jobs, which is especially critical for parallel aggregate
job processing on local/virtual multi-GPU machines. Tuning jobs assume homogeneous node resource composition for nodes with GPUs.
In resource-wrapped aggregate jobs, each sub-job and its
resource requirements are defined manually, enabling resource isolation.
For tuning aggregate jobs, individual jobs are generated automatically based on a hyperparameter
sweep configuration. This assumes homogeneous node resource composition for nodes with GPUs.
The two following files contain the core functionality of the Ray integration.
.. dropdown:: source/standalone/workflows/ray/wrap_resources.py
:icon: code
......@@ -66,7 +66,7 @@ sweep configuration. This assumes homogeneous node resource composition for node
The following script can be used to submit aggregate
jobs to one or more Ray cluster(s), which can be used for
running jobs on a remote cluster or simultaneous jobs with heterogeneous
resource requirements:
resource requirements.
.. dropdown:: source/standalone/workflows/ray/submit_job.py
:icon: code
......@@ -75,7 +75,7 @@ resource requirements:
:language: python
:emphasize-lines: 12-53
The following script can be used to extract KubeRay Cluster information for aggregate job submission.
The following script can be used to extract KubeRay cluster information for aggregate job submission.
.. dropdown:: source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:icon: code
......@@ -93,106 +93,56 @@ The following script can be used to easily create clusters on Google GKE.
:language: python
:emphasize-lines: 16-37
**Installation**
----------------
The Ray functionality requires additional dependencies be installed.
To use Ray without Kubernetes, like on a local computer or VM,
``kubectl`` is not required. For use on Kubernetes clusters with KubeRay,
such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_
The pythonic dependencies can be installed with:
**Docker-based Local Quickstart**
-----------------------------------
.. code-block:: bash
First, follow the `Docker Guide <https://isaac-sim.github.io/IsaacLab/main/source/deployment/docker.html>`_
to set up the NVIDIA Container Toolkit and Docker Compose.
# For multi-run support and resource isolation
./isaaclab.sh -p -m pip install ray[default]==2.31.0
# For hyperparameter tuning
./isaaclab.sh -p -m pip install ray[tune]==2.31.0
./isaaclab.sh -p -m pip install optuna bayesian-optimization
# MLFlow is needed only for fetching logs on clusters, not needed for local
./isaaclab.sh -p -m pip install mlflow
If using KubeRay clusters on Google GKE with the batteries-included cluster launch file,
the following dependencies are also needed.
.. code-block:: bash
./isaaclab.sh -p -m pip install kubernetes Jinja2
**Setup Overview: Cluster Configuration**
-----------------------------------------
Select one of the following methods to create a Ray Cluster to accept and execute dispatched jobs.
Single-Node Ray Cluster (Recommended for Beginners)
'''''''''''''''''''''''''''''''''''''''''''''''''''
For use on a single machine (node) such as a local computer or VM, the
following command can be used start a ray server. This is compatible with
multiple-GPU machines. This Ray server will run indefinitely until it is stopped with ``CTRL + C``
Then, run the following steps to start a tuning run.
.. code-block:: bash
# Build the base image, but we don't need to run it
python3 docker/container.py start && python3 docker/container.py stop
# Build the tuning image with extra deps
docker build -t isaacray -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
# Start the tuning image - symlink so that changes in the source folder show up in the container
docker run -v $(pwd)/source:/workspace/isaaclab/source -it --gpus all --net=host --entrypoint /bin/bash isaacray
# Start the Ray server within the tuning image
echo "import ray; ray.init(); import time; [time.sleep(10) for _ in iter(int, 1)]" | ./isaaclab.sh -p
KubeRay Clusters
''''''''''''''''
.. attention::
The ``ray`` command should be modified to use Isaac python, which could be achieved in a fashion similar to
``sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
/isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray``.
Google Cloud is currently the only platform tested, although
any cloud provider should work if one configures the following:
- An container registry (NGC, GCS artifact registry, AWS ECR, etc) with
an Isaac Lab image configured to support Ray. See ``cluster_configs/Dockerfile`` to see how to modify the ``isaac-lab-base``
container for Ray compatibility. Ray should use the isaac sim python shebang, and ``nvidia-smi``
should work within the container. Be careful with the setup here as
paths need to be configured correctly for everything to work. It's likely that
the example dockerfile will work out of the box and can be pushed to the registry, as
long as the base image has already been built as in the container guide
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
provided that your account or organization has been granted a GPU-budget. It is recommended
to use manual kubernetes services as opposed to "autopilot" services for cost-effective
experimentation as this way clusters can be completely shut down when not in use, although
this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_
- An MLFlow server that your cluster has access to.
- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
Google Cloud, which can be referenced for the format and MLFlow integration)
Ray Clusters (Without Kubernetes)
'''''''''''''''''''''''''''''''''
.. attention::
Modify the Ray command to use Isaac Python like in KubeRay Clusters, and follow the same
steps for creating an image/cluster permissions/bucket access.
See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
`Anyscale <https://www.anyscale.com/product>`_ for more information
In a different terminal, run the following.
**Dispatching Jobs and Tuning**
-------------------------------
Select one of the following guides that matches your desired Cluster configuration.
.. code-block:: bash
Simple Ray Cluster (Local/VM)
'''''''''''''''''''''''''''''
# In a new terminal (don't close the above) , enter the image with a new shell.
docker container ps
docker exec -it <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS> /bin/bash
# Start a tuning run, with one parallel worker per GPU
./isaaclab.sh -p source/standalone/workflows/ray/tuner.py \
--cfg_file source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleTheiaJobCfg \
--run_mode local \
--workflow source/standalone/workflows/rl_games/train.py \
--num_workers_per_node <NUMBER_OF_GPUS_IN_COMPUTER>
This guide assumes that there is a Ray cluster already running, and that this script is run locally on the cluster, or
that the cluster job submission address is known.
1.) Testing that the cluster works can be done as follows.
To view the training logs, in a different terminal, run the following and visit ``localhost:6006`` in a browser afterwards.
.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py --test
# In a new terminal (don't close the above) , enter the image with a new shell.
docker container ps
docker exec -it <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS> /bin/bash
# Start a tuning run, with one parallel worker per GPU
tensorboard --logdir=.
2.) Submitting resource-wrapped sub-jobs can be done as described in the following file:
Submitting resource-wrapped individual jobs instead of automatic tuning runs is described in the following file.
.. dropdown:: source/standalone/workflows/ray/wrap_resources.py
:icon: code
......@@ -201,13 +151,28 @@ that the cluster job submission address is known.
:language: python
:emphasize-lines: 14-66
3.) For tuning jobs, specify the hyperparameter sweep similar to the following two files.
Transferring files from the running container can be done as follows.
.. code-block:: bash
docker container ps
docker cp <ISAAC_RAY_IMAGE_ID_FROM_CONTAINER_PS>:</path/in/container/file> </path/on/host/>
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
For tuning jobs, specify the tuning job / hyperparameter sweep as child class of :class:`JobCfg` .
The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in
environment entrypoints and hydra arguments, although other workflows will work if provided a compatible
:class:`JobCfg`.
.. dropdown:: source/standalone/workflows/ray/tuner.py (JobCfg definition)
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:start-at: class JobCfg
:end-at: self.cfg = cfg
For example, see the following Cartpole Example configurations.
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:icon: code
......@@ -215,42 +180,87 @@ that the cluster job submission address is known.
.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:language: python
Then, see the local examples in the following file to see how to start a tuning run.
.. dropdown:: source/standalone/workflows/ray/tuner.py
:icon: code
**Remote Clusters**
-------------------------
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:emphasize-lines: 18-53
Select one of the following methods to create a Ray cluster to accept and execute dispatched jobs.
KubeRay Setup
~~~~~~~~~~~~~
If using KubeRay clusters on Google GKE with the batteries-included cluster launch file,
the following dependencies are also needed.
.. code-block:: bash
python3 -p -m pip install kubernetes Jinja2
For use on Kubernetes clusters with KubeRay,
such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_ .
To view the logs, simply run ``tensorboard --logdir=<LOCAL_STORAGE_PATH_READ_FROM_OUTPUT>``
Google Cloud is currently the only platform tested, although
any cloud provider should work if one configures the following.
Remote Ray Cluster Setup and Use
'''''''''''''''''''''''''''''''''
This guide assumes that one desires to create a cluster on a remote host or server. This
guide includes shared steps, and KubeRay or Ray specific steps. Follow all shared steps (part I and II), and then
only the KubeRay or Ray steps depending on your desired configuration, in order of shared steps part I, then
the configuration specific steps, then shared steps part II.
.. attention::
The ``ray`` command should be modified to use Isaac python, which could be achieved in a fashion similar to
``sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
/isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray``.
- An container registry (NGC, GCS artifact registry, AWS ECR, etc) with
an Isaac Lab image configured to support Ray. See ``cluster_configs/Dockerfile`` to see how to modify the ``isaac-lab-base``
container for Ray compatibility. Ray should use the isaac sim python shebang, and ``nvidia-smi``
should work within the container. Be careful with the setup here as
paths need to be configured correctly for everything to work. It's likely that
the example dockerfile will work out of the box and can be pushed to the registry, as
long as the base image has already been built as in the container guide.
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
provided that your account or organization has been granted a GPU-budget. It is recommended
to use manual kubernetes services as opposed to "autopilot" services for cost-effective
experimentation as this way clusters can be completely shut down when not in use, although
this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_ .
- An `MLFlow server <https://mlflow.org/docs/latest/getting-started/logging-first-model/step1-tracking-server.html>`_ that your cluster has access to
(already included for Google Cloud, which can be referenced for the format and MLFlow integration).
- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
Google Cloud, which can be referenced for the format and MLFlow integration).
Ray Clusters (Without Kubernetes) Setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. attention::
Modify the Ray command to use Isaac Python like in KubeRay clusters, and follow the same
steps for creating an image/cluster permissions.
See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
`Anyscale <https://www.anyscale.com/product>`_ for more information.
Also, create an `MLFlow server <https://mlflow.org/docs/latest/getting-started/logging-first-model/step1-tracking-server.html>`_ that your local
host and cluster have access to.
Shared Steps Between KubeRay and Pure Ray Part I
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Build the Isaac Ray image, and upload it to your container registry of choice.
1.) Install Ray on your local machine.
.. code-block:: bash
python3 -p -m pip install ray[default]==2.31.0
2.) Build the Isaac Ray image, and upload it to your container registry of choice.
.. code-block:: bash
# Login with NGC (nvcr.io) registry first, see docker steps in repo.
./isaaclab.sh -p docker/container.py start
python3 docker/container.py start
# Build the special Isaac Lab Ray Image
docker build -t <REGISTRY/IMAGE_NAME> -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
# Push the image to your registry of choice.
docker push <REGISTRY/IMAGE_NAME>
KubeRay Specific
~~~~~~~~~~~~~~~~
KubeRay Clusters Only
~~~~~~~~~~~~~~~~~~~~~
`k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters that can
easily be installed with ``snap install k9s --devmode``.
......@@ -273,11 +283,7 @@ easily be installed with ``snap install k9s --devmode``.
2.) Create the KubeRay cluster and an MLFlow server for receiving logs
that your cluster has access to.
This can be done automatically for Google GKE,
where instructions are included in the following creation file. More than once cluster
can be created at once. Each cluster can have heterogeneous resources if so desired,
although only
For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
Google's.
where instructions are included in the following creation file.
.. dropdown:: source/standalone/workflows/ray/launch.py
:icon: code
......@@ -286,6 +292,18 @@ Google's.
:language: python
:emphasize-lines: 15-37
For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
Google's.
.. dropdown:: source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.ninja
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/cluster_configs/google_cloud/kuberay.yaml.jinja
:language: python
3.) Fetch the KubeRay cluster IP addresses, and the MLFLow Server IP.
This can be done automatically for KubeRay clusters,
where instructions are included in the following fetching file.
......@@ -299,8 +317,8 @@ printed.
:language: python
:emphasize-lines: 14-26
Ray Specific
~~~~~~~~~~~~
Ray Clusters Only (Without Kubernetes)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Verify cluster access.
......@@ -311,17 +329,18 @@ a new line for each unique cluster. For one cluster, there should only be one li
3.) Start an MLFLow Server to receive the logs that the ray cluster has access to,
and determine the server URI.
Shared Steps Between KubeRay and Pure Ray Part II
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dispatching Steps Shared Between KubeRay and Pure Ray Part II
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Test that your cluster is operational with the following.
.. code-block:: bash
# Test that NVIDIA GPUs are visible and that Ray is operation with the following command:
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py
--jobs wrap_resources.py --test
python3 source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --test
2.) Submitting Jobs can be done in the following manner, with the following script.
2.) Submitting tuning and/or resource-wrapped jobs is described in the :file:`submit_job.py` file.
.. dropdown:: source/standalone/workflows/ray/submit_job.py
:icon: code
......@@ -330,16 +349,20 @@ Shared Steps Between KubeRay and Pure Ray Part II
:language: python
:emphasize-lines: 12-53
3.) For tuning jobs, specify the hyperparameter sweep similar to :class:`RLGamesCameraJobCfg` in the following file:
3.) For tuning jobs, specify the tuning job / hyperparameter sweep as a :class:`JobCfg` .
The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in
environment entrypoints and hydra arguments, although other workflows will work if provided a compatible
:class:`JobCfg`.
.. dropdown:: source/standalone/workflows/ray/tuner.py
.. dropdown:: source/standalone/workflows/ray/tuner.py (JobCfg definition)
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:emphasize-lines: 18-53
:start-at: class JobCfg
:end-at: self.cfg = cfg
For example, see the Cartpole Example configurations.
For example, see the following Cartpole Example configurations.
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:icon: code
......@@ -348,14 +371,12 @@ For example, see the Cartpole Example configurations.
:language: python
Tuning jobs can also be submitted via ``submit_job.py``
To view the tuning results, view the MLFlow dashboard of the server that you created.
For KubeRay, this can be done through port forwarding the MLFlow dashboard, with
For KubeRay, this can be done through port forwarding the MLFlow dashboard with the following.
``kubectl port-forward service/isaacray-mlflow 5000:5000``
and visiting the following address in a browser.
Then visit the following address in a browser.
``localhost:5000``
......@@ -366,8 +387,8 @@ this following command.
--uri http://localhost:5000 --experiment-name IsaacRay-<CLASS_JOB_CFG>-tune --download-dir test``
**Cluster Cleanup**
'''''''''''''''''''
Kubernetes Cluster Cleanup
''''''''''''''''''''''''''
For the sake of conserving resources, and potentially freeing precious GPU resources for other people to use
on shared compute platforms, please destroy the Ray cluster after use. They can be easily
......@@ -377,4 +398,5 @@ recreated! For KubeRay clusters, this can be done as follows.
kubectl get raycluster | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete raycluster &&
kubectl get deployments | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete deployment &&
kubectl get services | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete service
kubectl get services | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete service &&
kubectl get services | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete service
FROM isaac-lab-base:latest
# WGet is needed so that GCS or other cloud providers can mark the container as ready.
# Otherwise the Ray liveliness checks fail.
RUN apt-get update && apt-get install wget
# Set NVIDIA paths
ENV PATH="/usr/local/nvidia/bin:$PATH"
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64"
......
......@@ -19,7 +19,6 @@ spec:
block: "true"
dashboard-host: 0.0.0.0
dashboard-port: "8265"
node-ip-address: "0.0.0.0"
port: "6379"
include-dashboard: "true"
ray-debugger-external: "true"
......@@ -30,7 +29,7 @@ spec:
apiVersion: v1
kind: Service
metadata:
name: head
name: {{ name }}-head
spec:
type: LoadBalancer
template:
......@@ -130,7 +129,7 @@ spec:
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
command: ["/bin/bash", "-c", "ray start --address=head.{{ namespace }}.svc.cluster.local:6379 && tail -f /dev/null"]
command: ["/bin/bash", "-c", "ray start --address={{name}}-head.{{ namespace }}.svc.cluster.local:6379 && tail -f /dev/null"]
- image: fluent/fluent-bit:1.9.6
name: fluentbit
resources:
......
......@@ -21,7 +21,7 @@ Usage:
.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/grok_cluster_with_kubectl.py
python3 source/standalone/workflows/ray/grok_cluster_with_kubectl.py
# For options, supply -h arg
"""
......@@ -67,9 +67,10 @@ def get_clusters(pods: list, cluster_name_prefix: str) -> set:
match = re.match(r"(" + re.escape(cluster_name_prefix) + r"[-\w]+)", pod_name)
if match:
# Get base name without head/worker suffix
base_name = match.group(1).split("-head")[0].split("-worker")[0]
clusters.add(base_name)
# Get base name without head/worker suffix (skip workers)
if "head" in pod_name:
base_name = match.group(1).split("-head")[0]
clusters.add(base_name)
return sorted(clusters)
......@@ -90,9 +91,7 @@ def get_mlflow_info(namespace: str = None, cluster_prefix: str = "isaacray") ->
clusters = get_clusters(pods=pods, cluster_name_prefix=cluster_prefix)
if len(clusters) > 1:
raise ValueError("More than one cluster matches prefix, could not automatically determine mlflow info.")
base_name = cluster_prefix.split("-head")[0].split("-worker")[0]
mlflow_name = f"{base_name}-mlflow"
mlflow_name = f"{cluster_prefix}-mlflow"
cmd = ["kubectl", "get", "svc", mlflow_name, "-n", namespace, "--no-headers"]
try:
......@@ -102,7 +101,8 @@ def get_mlflow_info(namespace: str = None, cluster_prefix: str = "isaacray") ->
# Get cluster IP
cluster_ip = fields[2]
port = "5000" # Default MLflow port
# This needs to be http to be resolved. HTTPS can't be resolved
# This should be fine as it is on a subnet on the cluster regardless
return f"http://{cluster_ip}:{port}"
except subprocess.CalledProcessError as e:
raise ValueError(f"Could not grok MLflow: {e}") # Fixed f-string
......
......@@ -8,29 +8,28 @@ import pathlib
import subprocess
import yaml
import util
from jinja2 import Environment, FileSystemLoader
from kubernetes import config
import source.standalone.workflows.ray.util as util
"""This script helps create one or more KubeRay clusters.
Usage:
.. code-block:: bash
# If the head node is stuck on container creating, make sure to create a secret
./isaaclab.sh -p source/standalone/workflows/ray/launch.py -h
python3 source/standalone/workflows/ray/launch.py -h
# Examples
# The following creates 8 GPUx1 nvidia l4 workers
./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
python3 source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
--namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
--num_workers 8 --num_clusters 1 --worker_accelerator nvidia-l4 --gpu_per_worker 1
# The following creates 1 GPUx1 nvidia l4 worker, 2 GPUx2 nvidia-tesla-t4 workers,
# and 2 GPUx4 nvidia-tesla-t4 GPU workers
./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
python3 source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
--namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
--num_workers 1 2 --num_clusters 1 \
--worker_accelerator nvidia-l4 nvidia-tesla-t4 --gpu_per_worker 1 2 4
......@@ -53,7 +52,7 @@ def apply_manifest(args: argparse.Namespace) -> None:
# Set up Jinja2 environment for loading templates
templates_dir = RAY_DIR / "cluster_configs" / args.cluster_host
file_loader = FileSystemLoader(str(templates_dir))
jinja_env = Environment(loader=file_loader, keep_trailing_newline=True)
jinja_env = Environment(loader=file_loader, keep_trailing_newline=True, autoescape=True)
# Define template filename
template_file = "kuberay.yaml.jinja"
......@@ -79,6 +78,7 @@ def apply_manifest(args: argparse.Namespace) -> None:
# Apply the Kubernetes manifest using kubectl
try:
print(cleaned_yaml_string)
subprocess.run(["kubectl", "apply", "-f", "-"], input=cleaned_yaml_string, text=True, check=True)
except subprocess.CalledProcessError as e:
exit(f"An error occurred while running `kubectl`: {e}")
......
......@@ -40,16 +40,16 @@ Usage:
.. code-block:: bash
# Example; submitting a tuning job
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
python3 source/standalone/workflows/ray/submit_job.py \
--aggregate_jobs /workspace/isaaclab/source/standalone/workflows/ray/tuner.py \
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <ML_FLOW_URI>
--cfg_class CartpoleTheiaJobCfg --mlflow_uri <ML_FLOW_URI>
# Example: Submitting resource wrapped job
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --sub_jobs ./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-v0 --headless+./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-RGB-Camera-Direct-v0 --headless --enable_cameras agent.params.config.max_epochs=150
python3 source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --test
# For all command line arguments
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py -h
python3 source/standalone/workflows/ray/submit_job.py -h
"""
script_directory = os.path.dirname(os.path.abspath(__file__))
CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"}
......
......@@ -17,8 +17,9 @@ from ray.tune.search.repeater import Repeater
"""
This script breaks down an aggregate tuning job, as defined by a hyperparameter sweep configuration,
into individual jobs (shell commands) to run on the GPU-enabled nodes of the cluster.
By default, (unless combined as a sub-job in a resource-wrapped aggregate job), one worker is created
for each GPU-enabled node in the cluster for each individual job.
By default, one worker is created for each GPU-enabled node in the cluster for each individual job.
To use more than one worker per node (likely the case for multi-GPU machines), supply the
num_workers_per_node argument.
Each hyperparameter sweep configuration should include the workflow,
runner arguments, and hydra arguments to vary.
......@@ -39,16 +40,15 @@ Usage:
./isaaclab.sh -p source/standalone/workflows/ray/tuner.py -h
# Examples
# Local (not within a docker container, when within a local docker container, do not supply run_mode argument)
# Local
./isaaclab.sh -p source/standalone/workflows/ray/tuner.py --run_mode local \
--cfg_file source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg
# Local docker: start the ray server and run above command in the same running container without run_mode arg
--cfg_class CartpoleTheiaJobCfg
# Remote (run grok cluster or create config file mentioned in :file:`submit_job.py`)
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
--aggregate_jobs tuner.py \
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <MLFLOW_URI_FROM_GROK_OR_MANUAL>
--cfg_class CartpoleTheiaJobCfg --mlflow_uri <MLFLOW_URI_FROM_GROK_OR_MANUAL>
"""
......@@ -74,7 +74,7 @@ class IsaacLabTuneTrainable(tune.Trainable):
print(f"[INFO]: Recovered invocation with {self.invoke_cmd}")
self.experiment = None
def reset_config(self, new_config):
def reset_config(self, new_config: dict):
"""Allow environments to be re-used by fetching a new invocation command"""
self.setup(new_config)
return True
......@@ -95,15 +95,15 @@ class IsaacLabTuneTrainable(tune.Trainable):
self.proc = experiment["proc"]
self.experiment_name = experiment["experiment_name"]
self.isaac_logdir = experiment["logdir"]
self.tensorboard_logdir = self.isaac_logdir + f"/{self.experiment_name}/summaries"
self.tensorboard_logdir = self.isaac_logdir + "/" + self.experiment_name
self.done = False
if self.proc is None:
raise ValueError("Could not start trial.")
if self.proc.poll() is not None: # process finished, signal finish
proc_status = self.proc.poll()
if proc_status is not None: # process finished, signal finish
self.data["done"] = True
print("[INFO]: Process finished, returning...")
print(f"[INFO]: Process finished with {proc_status}, returning...")
else: # wait until the logs are ready or fresh
data = util.load_tensorboard_logs(self.tensorboard_logdir)
......@@ -220,10 +220,24 @@ class JobCfg:
"""To be compatible with :meth: invoke_tuning_run and :class:IsaacLabTuneTrainable,
at a minimum, the tune job should inherit from this class."""
def __init__(self, cfg):
def __init__(self, cfg: dict):
"""
Runner args include command line arguments passed to the task.
For example:
cfg["runner_args"]["headless_singleton"] = "--headless"
cfg["runner_args"]["enable_cameras_singleton"] = "--enable_cameras"
"""
assert "runner_args" in cfg, "No runner arguments specified."
"""
Task is the desired task to train on. For example:
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-TheiaTiny-v0"])
"""
assert "--task" in cfg["runner_args"], "No task specified."
assert "hydra_args" in cfg, "No hypeparameters specified."
"""
Hydra args define the hyperparameters varied within the sweep. For example:
cfg["hydra_args"]["agent.params.network.cnn.activation"] = tune.choice(["relu", "elu"])
"""
assert "hydra_args" in cfg, "No hyperparameters specified."
self.cfg = cfg
......
......@@ -6,15 +6,18 @@ import argparse
import os
import re
import subprocess
import threading
from datetime import datetime
from math import isclose
import ray
from tensorboard.backend.event_processing.directory_watcher import DirectoryDeletedError
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
def load_tensorboard_logs(directory: str) -> dict:
"""From a tensorboard directory, get the latest scalar values.
"""From a tensorboard directory, get the latest scalar values. If the logs can't be
found, check the summaries sublevel.
Args:
directory: The directory of the tensorboard logging.
......@@ -22,19 +25,23 @@ def load_tensorboard_logs(directory: str) -> dict:
Returns:
The latest available scalar values.
"""
# Initialize the event accumulator with a size guidance for only the latest entry
size_guidance = {"scalars": 1} # Load only the latest entry for scalars
event_acc = EventAccumulator(directory, size_guidance=size_guidance)
event_acc.Reload() # Load all data from the directory
def get_latest_scalars(path: str) -> dict:
event_acc = EventAccumulator(path, size_guidance={"scalars": 1})
try:
event_acc.Reload()
if event_acc.Tags()["scalars"]:
return {
tag: event_acc.Scalars(tag)[-1].value
for tag in event_acc.Tags()["scalars"]
if event_acc.Scalars(tag)
}
except (KeyError, OSError, RuntimeError, DirectoryDeletedError):
return {}
# Extract the latest scalars logged
latest_scalars = {}
for tag in event_acc.Tags()["scalars"]:
events = event_acc.Scalars(tag)
if events: # Check if there is at least one entry
latest_event = events[-1] # Get the latest entry
latest_scalars[tag] = latest_event.value
return latest_scalars
scalars = get_latest_scalars(directory)
return scalars or get_latest_scalars(os.path.join(directory, "summaries"))
def get_invocation_command_from_cfg(
......@@ -190,47 +197,62 @@ def execute_job(
experiment_info_pattern = re.compile("Exact experiment name requested from command line: (.+)")
logdir_pattern = re.compile(r"\[INFO\] Logging experiment in directory: (.+)$")
err_pattern = re.compile("There was an error (.+)$")
with process.stdout as stdout:
for line in iter(stdout.readline, ""):
def stream_reader(stream, identifier_string, result_details):
for line in iter(stream.readline, ""):
line = line.strip()
result_details.append(f"{identifier_string}: {line} \n")
result_details.append(f"{identifier_string}: {line}\n")
if log_all_output:
print(f"{identifier_string}: {line}")
if extract_experiment:
exp_match = experiment_info_pattern.search(line)
log_match = logdir_pattern.search(line)
err_match = err_pattern.search(line)
if err_match:
raise ValueError(f"Encountered an error during trial run. {' '.join(result_details)}")
if exp_match:
experiment_name = exp_match.group(1)
if log_match:
logdir = log_match.group(1)
if experiment_name and logdir:
result = {
"experiment_name": experiment_name,
"logdir": logdir,
"proc": process,
"result": " ".join(result_details),
}
return result
with process.stderr as stderr:
for line in iter(stderr.readline, ""):
line = line.strip()
result_details.append(f"{identifier_string}: {line}")
# Read stdout until we find experiment info
# Do some careful handling prevent overflowing the pipe reading buffer with error 141
for line in iter(process.stdout.readline, ""):
line = line.strip()
result_details.append(f"{identifier_string}: {line} \n")
if log_all_output:
print(f"{identifier_string}: {line}")
process.wait() # Wait for the subprocess to finish naturally if not exited early
now = datetime.now().strftime("%H:%M:%S.%f")
completion_info = f"\n[INFO]: {identifier_string}: Job Started at {start_time}, completed at {now}\n"
print(completion_info)
result_details.append(completion_info)
return " ".join(result_details)
if extract_experiment:
exp_match = experiment_info_pattern.search(line)
log_match = logdir_pattern.search(line)
err_match = err_pattern.search(line)
if err_match:
raise ValueError(f"Encountered an error during trial run. {' '.join(result_details)}")
if exp_match:
experiment_name = exp_match.group(1)
if log_match:
logdir = log_match.group(1)
if experiment_name and logdir:
# Start stderr reader after finding experiment info
stderr_thread = threading.Thread(
target=stream_reader, args=(process.stderr, identifier_string, result_details)
)
stderr_thread.daemon = True
stderr_thread.start()
# Start stdout reader to continue reading to flush buffer
stdout_thread = threading.Thread(
target=stream_reader, args=(process.stdout, identifier_string, result_details)
)
stdout_thread.daemon = True
stdout_thread.start()
return {
"experiment_name": experiment_name,
"logdir": logdir,
"proc": process,
"result": " ".join(result_details),
}
process.wait()
now = datetime.now().strftime("%H:%M:%S.%f")
completion_info = f"\n[INFO]: {identifier_string}: Job Started at {start_time}, completed at {now}\n"
print(completion_info)
result_details.append(completion_info)
return " ".join(result_details)
def get_gpu_node_resources(
......
......@@ -6,12 +6,11 @@
import argparse
import ray
import util
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
import source.standalone.workflows.ray.util as util
"""
This script dispatches sub-job(s) (either individual jobs or tuning aggregate jobs)
This script dispatches sub-job(s) (individual jobs, use :file:`tuner.py` for tuning jobs)
to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate
job. If no desired compute resources for each sub-job are specified,
this script creates one worker per available node for each node with GPU(s) in the cluster.
......
......@@ -93,6 +93,8 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
print(f"[INFO] Logging experiment in directory: {log_root_path}")
# specify directory for logging runs: {time-stamp}_{run_name}
log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
# This way, the Ray Tune workflow can extract experiment name.
print(f"Exact experiment name requested from command line: {log_dir}")
if agent_cfg.run_name:
log_dir += f"_{agent_cfg.run_name}"
log_dir = os.path.join(log_root_path, log_dir)
......
......@@ -89,7 +89,11 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device
# directory for logging into
log_dir = os.path.join("logs", "sb3", args_cli.task, datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
run_info = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
log_root_path = os.path.abspath(os.path.join("logs", "sb3", args_cli.task))
print(f"[INFO] Logging experiment in directory: {log_root_path}")
print(f"Exact experiment name requested from command line: {run_info}")
log_dir = os.path.join(log_root_path, run_info)
# dump the configuration into log-directory
dump_yaml(os.path.join(log_dir, "params", "env.yaml"), env_cfg)
dump_yaml(os.path.join(log_dir, "params", "agent.yaml"), agent_cfg)
......
......@@ -135,6 +135,7 @@ def main(env_cfg: ManagerBasedRLEnvCfg | DirectRLEnvCfg | DirectMARLEnvCfg, agen
print(f"[INFO] Logging experiment in directory: {log_root_path}")
# specify directory for logging runs: {time-stamp}_{run_name}
log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + f"_{algorithm}_{args_cli.ml_framework}"
print(f"Exact experiment name requested from command line {log_dir}")
if agent_cfg["agent"]["experiment"]["experiment_name"]:
log_dir += f'_{agent_cfg["agent"]["experiment"]["experiment_name"]}'
# set directory into agent config
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment