Unverified Commit 286e1eea authored by glvov-bdai's avatar glvov-bdai Committed by GitHub

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning,...

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud (#1301)

This PR adds Ray support, which enables a lot of really cool stuff by
leveraging the existing Hydra support, including but not limited to:

- Several training runs at once in parallel or consecutively with
minimal interaction
- Using the same training setup everywhere (on cloud and local) with
minimal overhead
- Tuning hyperparameters
- Tuning hyperparameters in parallel on multiple GPUs and/or multiple
GPU Nodes
- Simultaneously tuning model hyperparameters for different
environments/agents
- Resource Isolation
parent d8bc7256
......@@ -100,6 +100,7 @@ Table of Contents
source/features/hydra
source/features/multi_gpu
Tiled Rendering</source/overview/sensors/camera>
source/features/ray
source/features/reproducibility
.. toctree::
......
===========================
Ray Job Dispatch and Tuning
===========================
.. currentmodule:: omni.isaac.lab
Isaac Lab supports `Ray <https://docs.ray.io/en/latest/index.html>`_ for streamlining dispatching multiple training jobs (in parallel and in series),
and hyperparameter tuning, both on local and remote configurations.
This `independent community contributed walkthrough video <https://youtu.be/z7MDgSga2Ho?feature=shared>`_
demonstrates some of the core functionality of the Ray integration covered in this overview. Although there may be some
differences in the codebase (such as file names being shortened) since the creation of the video,
the general workflow is the same.
.. attention::
This functionality is experimental, and has been tested only on Linux.
.. contents:: Table of Contents
:depth: 3
:local:
Overview
--------
The Ray integration is useful for the following:
- Dispatching several training jobs in parallel or sequentially with minimal interaction
- Tuning hyperparameters; in parallel or sequentially with support for multiple GPUs and/or multiple GPU Nodes
- Using the same training setup everywhere (on cloud and local) with minimal overhead
- Resource Isolation for training jobs
The core functionality of the Ray workflow consists of two main scripts that enable the orchestration
of resource-wrapped and tuning aggregate jobs. These scripts facilitate the decomposition of
aggregate jobs (overarching experiments) into individual jobs, which are discrete commands
executed on the cluster. An aggregate job can include multiple individual jobs.
For clarity, this guide refers to the jobs one layer below the topmost aggregate level as sub-jobs.
Both resource-wrapped and tuning aggregate jobs dispatch individual jobs to a designated Ray
cluster, which leverages the cluster's resources (e.g., a single workstation node or multiple nodes)
to execute these jobs with workers in parallel and/or sequentially. By default, aggregate jobs use all \
available resources on each available GPU-enabled node for each sub-job worker. This can be changed through
specifying the ``--num_workers`` argument, especially critical for parallel aggregate
job processing on local or virtual multi-GPU machines
In resource-wrapped aggregate jobs, each sub-job and its
resource requirements are defined manually, enabling resource isolation.
For tuning aggregate jobs, individual jobs are generated automatically based on a hyperparameter
sweep configuration. This assumes homogeneous node resource composition for nodes with GPUs.
.. dropdown:: source/standalone/workflows/ray/wrap_resources.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/wrap_resources.py
:language: python
:emphasize-lines: 14-66
.. dropdown:: source/standalone/workflows/ray/tuner.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:emphasize-lines: 18-53
The following script can be used to submit aggregate
jobs to one or more Ray cluster(s), which can be used for
running jobs on a remote cluster or simultaneous jobs with heterogeneous
resource requirements:
.. dropdown:: source/standalone/workflows/ray/submit_job.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/submit_job.py
:language: python
:emphasize-lines: 12-53
The following script can be used to extract KubeRay Cluster information for aggregate job submission.
.. dropdown:: source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:language: python
:emphasize-lines: 14-26
The following script can be used to easily create clusters on Google GKE.
.. dropdown:: source/standalone/workflows/ray/launch.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/launch.py
:language: python
:emphasize-lines: 16-37
**Installation**
----------------
The Ray functionality requires additional dependencies be installed.
To use Ray without Kubernetes, like on a local computer or VM,
``kubectl`` is not required. For use on Kubernetes clusters with KubeRay,
such as Google Kubernetes Engine or Amazon Elastic Kubernetes Service, ``kubectl`` is required, and can
be installed via the `Kubernetes website <https://kubernetes.io/docs/tasks/tools/>`_
The pythonic dependencies can be installed with:
.. code-block:: bash
# For multi-run support and resource isolation
./isaaclab.sh -p -m pip install ray[default]==2.31.0
# For hyperparameter tuning
./isaaclab.sh -p -m pip install ray[tune]==2.31.0
./isaaclab.sh -p -m pip install optuna bayesian-optimization
# MLFlow is needed only for fetching logs on clusters, not needed for local
./isaaclab.sh -p -m pip install mlflow
If using KubeRay clusters on Google GKE with the batteries-included cluster launch file,
the following dependencies are also needed.
.. code-block:: bash
./isaaclab.sh -p -m pip install kubernetes Jinja2
**Setup Overview: Cluster Configuration**
-----------------------------------------
Select one of the following methods to create a Ray Cluster to accept and execute dispatched jobs.
Single-Node Ray Cluster (Recommended for Beginners)
'''''''''''''''''''''''''''''''''''''''''''''''''''
For use on a single machine (node) such as a local computer or VM, the
following command can be used start a ray server. This is compatible with
multiple-GPU machines. This Ray server will run indefinitely until it is stopped with ``CTRL + C``
.. code-block:: bash
echo "import ray; ray.init(); import time; [time.sleep(10) for _ in iter(int, 1)]" | ./isaaclab.sh -p
KubeRay Clusters
''''''''''''''''
.. attention::
The ``ray`` command should be modified to use Isaac python, which could be achieved in a fashion similar to
``sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
/isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray``.
Google Cloud is currently the only platform tested, although
any cloud provider should work if one configures the following:
- An container registry (NGC, GCS artifact registry, AWS ECR, etc) with
an Isaac Lab image configured to support Ray. See ``cluster_configs/Dockerfile`` to see how to modify the ``isaac-lab-base``
container for Ray compatibility. Ray should use the isaac sim python shebang, and ``nvidia-smi``
should work within the container. Be careful with the setup here as
paths need to be configured correctly for everything to work. It's likely that
the example dockerfile will work out of the box and can be pushed to the registry, as
long as the base image has already been built as in the container guide
- A Kubernetes setup with available NVIDIA RTX (likely ``l4`` or ``l40`` or ``tesla-t4`` or ``a10``) GPU-passthrough node-pool resources,
that has access to your container registry/storage bucket and has the Ray operator enabled with correct IAM
permissions. This can be easily achieved with services such as Google GKE or AWS EKS,
provided that your account or organization has been granted a GPU-budget. It is recommended
to use manual kubernetes services as opposed to "autopilot" services for cost-effective
experimentation as this way clusters can be completely shut down when not in use, although
this may require installing the `Nvidia GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/google-gke.html>`_
- An MLFlow server that your cluster has access to.
- A ``kuberay.yaml.ninja`` file that describes how to allocate resources (already included for
Google Cloud, which can be referenced for the format and MLFlow integration)
Ray Clusters (Without Kubernetes)
'''''''''''''''''''''''''''''''''
.. attention::
Modify the Ray command to use Isaac Python like in KubeRay Clusters, and follow the same
steps for creating an image/cluster permissions/bucket access.
See the `Ray Clusters Overview <https://docs.ray.io/en/latest/cluster/getting-started.html>`_ or
`Anyscale <https://www.anyscale.com/product>`_ for more information
**Dispatching Jobs and Tuning**
-------------------------------
Select one of the following guides that matches your desired Cluster configuration.
Simple Ray Cluster (Local/VM)
'''''''''''''''''''''''''''''
This guide assumes that there is a Ray cluster already running, and that this script is run locally on the cluster, or
that the cluster job submission address is known.
1.) Testing that the cluster works can be done as follows.
.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py --test
2.) Submitting resource-wrapped sub-jobs can be done as described in the following file:
.. dropdown:: source/standalone/workflows/ray/wrap_resources.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/wrap_resources.py
:language: python
:emphasize-lines: 14-66
3.) For tuning jobs, specify the hyperparameter sweep similar to the following two files.
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cfg.py
:language: python
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:language: python
Then, see the local examples in the following file to see how to start a tuning run.
.. dropdown:: source/standalone/workflows/ray/tuner.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:emphasize-lines: 18-53
To view the logs, simply run ``tensorboard --logdir=<LOCAL_STORAGE_PATH_READ_FROM_OUTPUT>``
Remote Ray Cluster Setup and Use
'''''''''''''''''''''''''''''''''
This guide assumes that one desires to create a cluster on a remote host or server. This
guide includes shared steps, and KubeRay or Ray specific steps. Follow all shared steps (part I and II), and then
only the KubeRay or Ray steps depending on your desired configuration, in order of shared steps part I, then
the configuration specific steps, then shared steps part II.
Shared Steps Between KubeRay and Pure Ray Part I
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Build the Isaac Ray image, and upload it to your container registry of choice.
.. code-block:: bash
# Login with NGC (nvcr.io) registry first, see docker steps in repo.
./isaaclab.sh -p docker/container.py start
# Build the special Isaac Lab Ray Image
docker build -t <REGISTRY/IMAGE_NAME> -f source/standalone/workflows/ray/cluster_configs/Dockerfile .
# Push the image to your registry of choice.
docker push <REGISTRY/IMAGE_NAME>
KubeRay Specific
~~~~~~~~~~~~~~~~
`k9s <https://github.com/derailed/k9s>`_ is a great tool for monitoring your clusters that can
easily be installed with ``snap install k9s --devmode``.
1.) Verify cluster access, and that the correct operators are installed.
.. code-block:: bash
# Verify cluster access
kubectl cluster-info
# If using a manually managed cluster (not Autopilot or the like)
# verify that there are node pools
kubectl get nodes
# Check that the ray operator is installed on the cluster
# should list rayclusters.ray.io , rayjobs.ray.io , and rayservices.ray.io
kubectl get crds | grep ray
# Check that the NVIDIA Driver Operator is installed on the cluster
# should list clusterpolicies.nvidia.com
kubectl get crds | grep nvidia
2.) Create the KubeRay cluster and an MLFlow server for receiving logs
that your cluster has access to.
This can be done automatically for Google GKE,
where instructions are included in the following creation file. More than once cluster
can be created at once. Each cluster can have heterogeneous resources if so desired,
although only
For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
Google's.
.. dropdown:: source/standalone/workflows/ray/launch.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/launch.py
:language: python
:emphasize-lines: 15-37
3.) Fetch the KubeRay cluster IP addresses, and the MLFLow Server IP.
This can be done automatically for KubeRay clusters,
where instructions are included in the following fetching file.
The KubeRay clusters are saved to a file, but the MLFLow Server IP is
printed.
.. dropdown:: source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/grok_cluster_with_kubectl.py
:language: python
:emphasize-lines: 14-26
Ray Specific
~~~~~~~~~~~~
1.) Verify cluster access.
2.) Create a ``~/.cluster_config`` file, where ``name: <NAME> address: http://<IP>:<PORT>`` is on
a new line for each unique cluster. For one cluster, there should only be one line in this file.
3.) Start an MLFLow Server to receive the logs that the ray cluster has access to,
and determine the server URI.
Shared Steps Between KubeRay and Pure Ray Part II
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.) Test that your cluster is operational with the following.
.. code-block:: bash
# Test that NVIDIA GPUs are visible and that Ray is operation with the following command:
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py
--jobs wrap_resources.py --test
2.) Submitting Jobs can be done in the following manner, with the following script.
.. dropdown:: source/standalone/workflows/ray/submit_job.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/submit_job.py
:language: python
:emphasize-lines: 12-53
3.) For tuning jobs, specify the hyperparameter sweep similar to :class:`RLGamesCameraJobCfg` in the following file:
.. dropdown:: source/standalone/workflows/ray/tuner.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/tuner.py
:language: python
:emphasize-lines: 18-53
For example, see the Cartpole Example configurations.
.. dropdown:: source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:icon: code
.. literalinclude:: ../../../source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py
:language: python
Tuning jobs can also be submitted via ``submit_job.py``
To view the tuning results, view the MLFlow dashboard of the server that you created.
For KubeRay, this can be done through port forwarding the MLFlow dashboard, with
``kubectl port-forward service/isaacray-mlflow 5000:5000``
and visiting the following address in a browser.
``localhost:5000``
If the MLFlow port is forwarded like above, it can be converted into tensorboard logs with
this following command.
``./isaaclab.sh -p source/standalone/workflows/ray/mlflow_to_local_tensorboard.py \
--uri http://localhost:5000 --experiment-name IsaacRay-<CLASS_JOB_CFG>-tune --download-dir test``
**Cluster Cleanup**
'''''''''''''''''''
For the sake of conserving resources, and potentially freeing precious GPU resources for other people to use
on shared compute platforms, please destroy the Ray cluster after use. They can be easily
recreated! For KubeRay clusters, this can be done as follows.
.. code-block:: bash
kubectl get raycluster | egrep 'isaacray' | awk '{print $1}' | xargs kubectl delete raycluster &&
kubectl get deployments | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete deployment &&
kubectl get services | egrep 'mlflow' | awk '{print $1}' | xargs kubectl delete service
FROM isaac-lab-base:latest
# Set NVIDIA paths
ENV PATH="/usr/local/nvidia/bin:$PATH"
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64"
# Link NVIDIA binaries
RUN ln -sf /usr/local/nvidia/bin/nvidia* /usr/bin
# Install Ray and configure it
RUN /workspace/isaaclab/_isaac_sim/python.sh -m pip install "ray[default, tune]"==2.31.0 && \
sed -i "1i $(echo "#!/workspace/isaaclab/_isaac_sim/python.sh")" \
/isaac-sim/kit/python/bin/ray && ln -s /isaac-sim/kit/python/bin/ray /usr/local/bin/ray
# Install tuning dependencies
RUN /workspace/isaaclab/_isaac_sim/python.sh -m pip install optuna bayesian-optimization
# Install MLflow for logging
RUN /workspace/isaaclab/_isaac_sim/python.sh -m pip install mlflow
# Jinja is used for templating here as full helm setup is excessive for application
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: {{ name }}
namespace: {{ namespace }}
spec:
rayVersion: "2.8.0"
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 120
imagePullPolicy: Always
securityContext: {}
envFrom: []
headGroupSpec:
rayStartParams:
block: "true"
dashboard-host: 0.0.0.0
dashboard-port: "8265"
node-ip-address: "0.0.0.0"
port: "6379"
include-dashboard: "true"
ray-debugger-external: "true"
object-manager-port: "8076"
num-gpus: "0"
num-cpus: "0" # prevent scheduling jobs to the head node - workers only
headService:
apiVersion: v1
kind: Service
metadata:
name: head
spec:
type: LoadBalancer
template:
metadata:
labels:
app.kubernetes.io/instance: tuner
app.kubernetes.io/name: kuberay
cloud.google.com/gke-ray-node-type: head
spec:
serviceAccountName: {{ service_account_name }}
affinity: {}
securityContext:
fsGroup: 100
containers:
- env:
image: {{ image }}
imagePullPolicy: Always
name: head
resources:
limits:
cpu: "{{ num_head_cpu }}"
memory: {{ head_ram_gb }}G
nvidia.com/gpu: "0"
requests:
cpu: "{{ num_head_cpu }}"
memory: {{ head_ram_gb }}G
nvidia.com/gpu: "0"
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
command: ["/bin/bash", "-c", "ray start --head --port=6379 --object-manager-port=8076 --dashboard-host=0.0.0.0 --dashboard-port=8265 --include-dashboard=true && tail -f /dev/null"]
- image: fluent/fluent-bit:1.9.6
name: fluentbit
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
imagePullSecrets: []
nodeSelector:
iam.gke.io/gke-metadata-server-enabled: "true"
volumes:
- configMap:
name: fluentbit-config
name: fluentbit-config
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
{% for it in range(gpu_per_worker|length) %}
- groupName: "{{ worker_accelerator[it] }}x{{ gpu_per_worker[it] }}-cpu-{{ cpu_per_worker[it] }}-ram-gb-{{ ram_gb_per_worker[it] }}"
replicas: {{ num_workers[it] }}
maxReplicas: {{ num_workers[it] }}
minReplicas: {{ num_workers[it] }}
rayStartParams:
block: "true"
ray-debugger-external: "true"
replicas: "{{num_workers[it]}}"
template:
metadata:
annotations: {}
labels:
app.kubernetes.io/instance: tuner
app.kubernetes.io/name: kuberay
cloud.google.com/gke-ray-node-type: worker
spec:
serviceAccountName: {{ service_account_name }}
affinity: {}
securityContext:
fsGroup: 100
containers:
- env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
image: {{ image }}
imagePullPolicy: Always
name: ray-worker
resources:
limits:
cpu: "{{ cpu_per_worker[it] }}"
memory: {{ ram_gb_per_worker[it] }}G
nvidia.com/gpu: "{{ gpu_per_worker[it] }}"
requests:
cpu: "{{ cpu_per_worker[it] }}"
memory: {{ ram_gb_per_worker[it] }}G
nvidia.com/gpu: "{{ gpu_per_worker[it] }}"
securityContext: {}
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
command: ["/bin/bash", "-c", "ray start --address=head.{{ namespace }}.svc.cluster.local:6379 && tail -f /dev/null"]
- image: fluent/fluent-bit:1.9.6
name: fluentbit
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
imagePullSecrets: []
nodeSelector:
cloud.google.com/gke-accelerator: {{ worker_accelerator[it] }}
iam.gke.io/gke-metadata-server-enabled: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
volumes:
- configMap:
name: fluentbit-config
name: fluentbit-config
- name: ray-logs
emptyDir: {}
{% endfor %}
---
# ML Flow Server - for fetching logs
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{name}}-mlflow
namespace: {{ namespace }}
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:v2.9.2
ports:
- containerPort: 5000
command: ["mlflow"]
args:
- server
- --host=0.0.0.0
- --port=5000
- --backend-store-uri=sqlite:///mlflow.db
---
# ML Flow Service (for port forwarding, kubectl port-forward service/{name}-mlflow 5000:5000)
apiVersion: v1
kind: Service
metadata:
name: {{name}}-mlflow
namespace: {{ namespace }}
spec:
selector:
app: mlflow
ports:
- port: 5000
targetPort: 5000
type: ClusterIP
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import os
import re
import subprocess
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
"""
This script requires that kubectl is installed and KubeRay was used to create the cluster.
Creates a config file containing ``name: <NAME> address: http://<IP>:<PORT>`` on
a new line for each cluster, and also fetches the MLFlow URI.
Usage:
.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/grok_cluster_with_kubectl.py
# For options, supply -h arg
"""
def get_namespace() -> str:
"""Get the current Kubernetes namespace from the context, fallback to default if not set"""
try:
namespace = (
subprocess.check_output(["kubectl", "config", "view", "--minify", "--output", "jsonpath={..namespace}"])
.decode()
.strip()
)
if not namespace:
namespace = "default"
except subprocess.CalledProcessError:
namespace = "default"
return namespace
def get_pods(namespace: str = "default") -> list[tuple]:
"""Get a list of all of the pods in the namespace"""
cmd = ["kubectl", "get", "pods", "-n", namespace, "--no-headers"]
output = subprocess.check_output(cmd).decode()
pods = []
for line in output.strip().split("\n"):
fields = line.split()
pod_name = fields[0]
status = fields[2]
pods.append((pod_name, status))
return pods
def get_clusters(pods: list, cluster_name_prefix: str) -> set:
"""
Get unique cluster name(s). Works for one or more clusters, based off of the number of head nodes.
Excludes MLflow deployments.
"""
clusters = set()
for pod_name, _ in pods:
# Skip MLflow pods
if "-mlflow" in pod_name:
continue
match = re.match(r"(" + re.escape(cluster_name_prefix) + r"[-\w]+)", pod_name)
if match:
# Get base name without head/worker suffix
base_name = match.group(1).split("-head")[0].split("-worker")[0]
clusters.add(base_name)
return sorted(clusters)
def get_mlflow_info(namespace: str = None, cluster_prefix: str = "isaacray") -> str:
"""
Get MLflow service information if it exists in the namespace with the given prefix.
Only works for a single cluster instance.
Args:
namespace: Kubernetes namespace
cluster_prefix: Base cluster name (without -head/-worker suffixes)
Returns:
MLflow service URL
"""
# Strip any -head or -worker suffixes to get base name
if namespace is None:
namespace = get_namespace()
pods = get_pods(namespace=namespace)
clusters = get_clusters(pods=pods, cluster_name_prefix=cluster_prefix)
if len(clusters) > 1:
raise ValueError("More than one cluster matches prefix, could not automatically determine mlflow info.")
base_name = cluster_prefix.split("-head")[0].split("-worker")[0]
mlflow_name = f"{base_name}-mlflow"
cmd = ["kubectl", "get", "svc", mlflow_name, "-n", namespace, "--no-headers"]
try:
output = subprocess.check_output(cmd).decode()
fields = output.strip().split()
# Get cluster IP
cluster_ip = fields[2]
port = "5000" # Default MLflow port
return f"http://{cluster_ip}:{port}"
except subprocess.CalledProcessError as e:
raise ValueError(f"Could not grok MLflow: {e}") # Fixed f-string
def check_clusters_running(pods: list, clusters: set) -> bool:
"""
Check that all of the pods in all provided clusters are running.
Args:
pods (list): A list of tuples where each tuple contains the pod name and its status.
clusters (set): A set of cluster names to check.
Returns:
bool: True if all pods in any of the clusters are running, False otherwise.
"""
clusters_running = False
for cluster in clusters:
cluster_pods = [p for p in pods if p[0].startswith(cluster)]
total_pods = len(cluster_pods)
running_pods = len([p for p in cluster_pods if p[1] == "Running"])
if running_pods == total_pods and running_pods > 0:
clusters_running = True
break
return clusters_running
def get_ray_address(head_pod: str, namespace: str = "default", ray_head_name: str = "head") -> str:
"""
Given a cluster head pod, check its logs, which should include the ray address which can accept job requests.
Args:
head_pod (str): The name of the head pod.
namespace (str, optional): The Kubernetes namespace. Defaults to "default".
ray_head_name (str, optional): The name of the ray head container. Defaults to "head".
Returns:
str: The ray address if found, None otherwise.
Raises:
ValueError: If the logs cannot be retrieved or the ray address is not found.
"""
cmd = ["kubectl", "logs", head_pod, "-c", ray_head_name, "-n", namespace]
try:
output = subprocess.check_output(cmd).decode()
except subprocess.CalledProcessError as e:
raise ValueError(
f"Could not enter head container with cmd {cmd}: {e}Perhaps try a different namespace or ray head name."
)
match = re.search(r"RAY_ADDRESS='([^']+)'", output)
if match:
return match.group(1)
else:
return None
def process_cluster(cluster_info: dict, ray_head_name: str = "head") -> str:
"""
For each cluster, check that it is running, and get the Ray head address that will accept jobs.
Args:
cluster_info (dict): A dictionary containing cluster information with keys 'cluster', 'pods', and 'namespace'.
ray_head_name (str, optional): The name of the ray head container. Defaults to "head".
Returns:
str: A string containing the cluster name and its Ray head address, or an error message if the head pod or Ray address is not found.
"""
cluster, pods, namespace = cluster_info
head_pod = None
for pod_name, status in pods:
if pod_name.startswith(cluster + "-head"):
head_pod = pod_name
break
if not head_pod:
return f"Error: Could not find head pod for cluster {cluster}\n"
# Get RAY_ADDRESS and status
ray_address = get_ray_address(head_pod, namespace=namespace, ray_head_name=ray_head_name)
if not ray_address:
return f"Error: Could not find RAY_ADDRESS for cluster {cluster}\n"
# Return only cluster and ray address
return f"name: {cluster} address: {ray_address}\n"
def main():
# Parse command-line arguments
parser = argparse.ArgumentParser(description="Process Ray clusters and save their specifications.")
parser.add_argument("--prefix", default="isaacray", help="The prefix for the cluster names.")
parser.add_argument("--output", default="~/.cluster_config", help="The file to save cluster specifications.")
parser.add_argument("--ray_head_name", default="head", help="The metadata name for the ray head container")
parser.add_argument(
"--namespace", help="Kubernetes namespace to use. If not provided, will detect from current context."
)
args = parser.parse_args()
# Get namespace from args or detect it
current_namespace = args.namespace if args.namespace else get_namespace()
print(f"Using namespace: {current_namespace}")
cluster_name_prefix = args.prefix
cluster_spec_file = os.path.expanduser(args.output)
# Get all pods
pods = get_pods(namespace=current_namespace)
# Get clusters
clusters = get_clusters(pods, cluster_name_prefix)
if not clusters:
print(f"No clusters found with prefix {cluster_name_prefix}")
return
# Wait for clusters to be running
while True:
pods = get_pods(namespace=current_namespace)
if check_clusters_running(pods, clusters):
break
print("Waiting for all clusters to spin up...")
time.sleep(5)
print("Checking for MLflow:")
# Check MLflow status for each cluster
for cluster in clusters:
try:
mlflow_address = get_mlflow_info(current_namespace, cluster)
print(f"MLflow address for {cluster}: {mlflow_address}")
except ValueError as e:
print(f"ML Flow not located: {e}")
print()
# Prepare cluster info for parallel processing
cluster_infos = []
for cluster in clusters:
cluster_pods = [p for p in pods if p[0].startswith(cluster)]
cluster_infos.append((cluster, cluster_pods, current_namespace))
# Use ThreadPoolExecutor to process clusters in parallel
results = []
results_lock = threading.Lock()
with ThreadPoolExecutor() as executor:
future_to_cluster = {
executor.submit(process_cluster, info, args.ray_head_name): info[0] for info in cluster_infos
}
for future in as_completed(future_to_cluster):
cluster_name = future_to_cluster[future]
try:
result = future.result()
with results_lock:
results.append(result)
except Exception as exc:
print(f"{cluster_name} generated an exception: {exc}")
# Sort results alphabetically by cluster name
results.sort()
# Write sorted results to the output file (Ray info only)
with open(cluster_spec_file, "w") as f:
for result in results:
f.write(result)
print(f"Cluster spec information saved to {cluster_spec_file}")
# Display the contents of the config file
with open(cluster_spec_file) as f:
print(f.read())
if __name__ == "__main__":
main()
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import pathlib
import sys
# Allow for import of items from the ray workflow.
CUR_DIR = pathlib.Path(__file__).parent
UTIL_DIR = CUR_DIR.parent
sys.path.extend([str(UTIL_DIR), str(CUR_DIR)])
import util
import vision_cfg
from ray import tune
class CartpoleRGBNoTuneJobCfg(vision_cfg.CameraJobCfg):
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-v0"])
super().__init__(cfg, vary_env_count=False, vary_cnn=False, vary_mlp=False)
class CartpoleRGBCNNOnlyJobCfg(vision_cfg.CameraJobCfg):
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-v0"])
super().__init__(cfg, vary_env_count=False, vary_cnn=True, vary_mlp=False)
class CartpoleRGBJobCfg(vision_cfg.CameraJobCfg):
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-v0"])
super().__init__(cfg, vary_env_count=True, vary_cnn=True, vary_mlp=True)
class CartpoleResNetJobCfg(vision_cfg.ResNetCameraJob):
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-ResNet18-v0"])
super().__init__(cfg)
class CartpoleTheiaJobCfg(vision_cfg.TheiaCameraJob):
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["runner_args"]["--task"] = tune.choice(["Isaac-Cartpole-RGB-TheiaTiny-v0"])
super().__init__(cfg)
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import pathlib
import sys
# Allow for import of items from the ray workflow.
UTIL_DIR = pathlib.Path(__file__).parent.parent.parent
sys.path.append(str(UTIL_DIR))
import tuner
import util
from ray import tune
class CameraJobCfg(tuner.JobCfg):
"""In order to be compatible with :meth: invoke_tuning_run, and
:class:IsaacLabTuneTrainable , configurations should
be in a similar format to this class. This class can vary env count/horizon length,
CNN structure, and MLP structure. Broad possible ranges are set, the specific values
that work can be found via tuning. Tuning results can inform better ranges for a second tuning run.
These ranges were selected for demonstration purposes. Best ranges are run/task specific."""
@staticmethod
def _get_batch_size_divisors(batch_size: int, min_size: int = 128) -> list[int]:
"""Get valid batch divisors to combine with num_envs and horizon length"""
divisors = [i for i in range(min_size, batch_size + 1) if batch_size % i == 0]
return divisors if divisors else [min_size]
def __init__(self, cfg={}, vary_env_count: bool = False, vary_cnn: bool = False, vary_mlp: bool = False):
cfg = util.populate_isaac_ray_cfg_args(cfg)
# Basic configuration
cfg["runner_args"]["headless_singleton"] = "--headless"
cfg["runner_args"]["enable_cameras_singleton"] = "--enable_cameras"
cfg["hydra_args"]["agent.params.config.max_epochs"] = 200
if vary_env_count: # Vary the env count, and horizon length, and select a compatible mini-batch size
# Check from 512 to 8196 envs in powers of 2
# check horizon lengths of 8 to 256
# More envs should be better, but different batch sizes can improve gradient estimation
env_counts = [2**x for x in range(9, 13)]
horizon_lengths = [2**x for x in range(3, 8)]
selected_env_count = tune.choice(env_counts)
selected_horizon = tune.choice(horizon_lengths)
cfg["runner_args"]["--num_envs"] = selected_env_count
cfg["hydra_args"]["agent.params.config.horizon_length"] = selected_horizon
def get_valid_batch_size(config):
num_envs = config["runner_args"]["--num_envs"]
horizon_length = config["hydra_args"]["agent.params.config.horizon_length"]
total_batch = horizon_length * num_envs
divisors = self._get_batch_size_divisors(total_batch)
return divisors[0]
cfg["hydra_args"]["agent.params.config.minibatch_size"] = tune.sample_from(get_valid_batch_size)
if vary_cnn: # Vary the depth, and size of the layers in the CNN part of the agent
# Also varies kernel size, and stride.
num_layers = tune.randint(2, 3)
cfg["hydra_args"]["agent.params.network.cnn.type"] = "conv2d"
cfg["hydra_args"]["agent.params.network.cnn.activation"] = tune.choice(["relu", "elu"])
cfg["hydra_args"]["agent.params.network.cnn.initializer"] = "{name:default}"
cfg["hydra_args"]["agent.params.network.cnn.regularizer"] = "{name:None}"
def get_cnn_layers(_):
layers = []
size = 64 # Initial input size
for _ in range(num_layers.sample()):
# Get valid kernel sizes for current size
valid_kernels = [k for k in [3, 4, 6, 8, 10, 12] if k <= size]
if not valid_kernels:
break
kernel = int(tune.choice([str(k) for k in valid_kernels]).sample())
stride = int(tune.choice(["1", "2", "3", "4"]).sample())
padding = int(tune.choice(["0", "1"]).sample())
# Calculate next size
next_size = ((size + 2 * padding - kernel) // stride) + 1
if next_size <= 0:
break
layers.append({
"filters": tune.randint(16, 32).sample(),
"kernel_size": str(kernel),
"strides": str(stride),
"padding": str(padding),
})
size = next_size
return layers
cfg["hydra_args"]["agent.params.network.cnn.convs"] = tune.sample_from(get_cnn_layers)
if vary_mlp: # Vary the MLP structure; neurons (units) per layer, number of layers,
max_num_layers = 6
max_neurons_per_layer = 128
if "env.observations.policy.image.params.model_name" in cfg["hydra_args"]:
# By decreasing MLP size when using pretrained helps prevent out of memory on L4
max_num_layers = 3
max_neurons_per_layer = 32
if "agent.params.network.cnn.convs" in cfg["hydra_args"]:
# decrease MLP size to prevent running out of memory on L4
max_num_layers = 2
max_neurons_per_layer = 32
num_layers = tune.randint(1, max_num_layers)
def get_mlp_layers(_):
return [tune.randint(4, max_neurons_per_layer).sample() for _ in range(num_layers.sample())]
cfg["hydra_args"]["agent.params.network.mlp.units"] = tune.sample_from(get_mlp_layers)
cfg["hydra_args"]["agent.params.network.mlp.initializer.name"] = tune.choice(["default"]).sample()
cfg["hydra_args"]["agent.params.network.mlp.activation"] = tune.choice(
["relu", "tanh", "sigmoid", "elu"]
).sample()
super().__init__(cfg)
class ResNetCameraJob(CameraJobCfg):
"""Try different ResNet sizes."""
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["hydra_args"]["env.observations.policy.image.params.model_name"] = tune.choice(
["resnet18", "resnet34", "resnet50", "resnet101"]
)
super().__init__(cfg, vary_env_count=True, vary_cnn=False, vary_mlp=True)
class TheiaCameraJob(CameraJobCfg):
"""Try different Theia sizes."""
def __init__(self, cfg: dict = {}):
cfg = util.populate_isaac_ray_cfg_args(cfg)
cfg["hydra_args"]["env.observations.policy.image.params.model_name"] = tune.choice([
"theia-tiny-patch16-224-cddsv",
"theia-tiny-patch16-224-cdiv",
"theia-small-patch16-224-cdiv",
"theia-base-patch16-224-cdiv",
"theia-small-patch16-224-cddsv",
"theia-base-patch16-224-cddsv",
])
super().__init__(cfg, vary_env_count=True, vary_cnn=False, vary_mlp=True)
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import pathlib
import subprocess
import yaml
from jinja2 import Environment, FileSystemLoader
from kubernetes import config
import source.standalone.workflows.ray.util as util
"""This script helps create one or more KubeRay clusters.
Usage:
.. code-block:: bash
# If the head node is stuck on container creating, make sure to create a secret
./isaaclab.sh -p source/standalone/workflows/ray/launch.py -h
# Examples
# The following creates 8 GPUx1 nvidia l4 workers
./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
--namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
--num_workers 8 --num_clusters 1 --worker_accelerator nvidia-l4 --gpu_per_worker 1
# The following creates 1 GPUx1 nvidia l4 worker, 2 GPUx2 nvidia-tesla-t4 workers,
# and 2 GPUx4 nvidia-tesla-t4 GPU workers
./isaaclab.sh -p source/standalone/workflows/ray/launch.py --cluster_host google_cloud \
--namespace <NAMESPACE> --image <YOUR_ISAAC_RAY_IMAGE> \
--num_workers 1 2 --num_clusters 1 \
--worker_accelerator nvidia-l4 nvidia-tesla-t4 --gpu_per_worker 1 2 4
"""
RAY_DIR = pathlib.Path(__file__).parent
def apply_manifest(args: argparse.Namespace) -> None:
"""Provided a Jinja templated ray.io/v1alpha1 file,
populate the arguments and create the cluster. Additionally, create
kubernetes containers for resources separated by '---' from the rest
of the file.
Args:
args: Possible arguments concerning cluster parameters.
"""
# Load Kubernetes configuration
config.load_kube_config()
# Set up Jinja2 environment for loading templates
templates_dir = RAY_DIR / "cluster_configs" / args.cluster_host
file_loader = FileSystemLoader(str(templates_dir))
jinja_env = Environment(loader=file_loader, keep_trailing_newline=True)
# Define template filename
template_file = "kuberay.yaml.jinja"
# Convert args namespace to a dictionary
template_params = vars(args)
# Load and render the template
template = jinja_env.get_template(template_file)
file_contents = template.render(template_params)
# Parse all YAML documents in the rendered template
all_yamls = []
for doc in yaml.safe_load_all(file_contents):
all_yamls.append(doc)
# Convert back to YAML string, preserving multiple documents
cleaned_yaml_string = ""
for i, doc in enumerate(all_yamls):
if i > 0:
cleaned_yaml_string += "\n---\n"
cleaned_yaml_string += yaml.dump(doc)
# Apply the Kubernetes manifest using kubectl
try:
subprocess.run(["kubectl", "apply", "-f", "-"], input=cleaned_yaml_string, text=True, check=True)
except subprocess.CalledProcessError as e:
exit(f"An error occurred while running `kubectl`: {e}")
def parse_args() -> argparse.Namespace:
"""
Parse command-line arguments for Kubernetes deployment script.
Returns:
argparse.Namespace: Parsed command-line arguments.
"""
arg_parser = argparse.ArgumentParser(
description="Script to apply manifests to create Kubernetes objects for Ray clusters.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
arg_parser.add_argument(
"--cluster_host",
type=str,
default="google_cloud",
choices=["google_cloud"],
help=(
"In the cluster_configs directory, the name of the folder where a tune.yaml.jinja"
"file exists defining the KubeRay config. Currently only google_cloud is supported."
),
)
arg_parser.add_argument(
"--name",
type=str,
required=False,
default="isaacray",
help="Name of the Kubernetes deployment.",
)
arg_parser.add_argument(
"--namespace",
type=str,
required=False,
default="default",
help="Kubernetes namespace to deploy the Ray cluster.",
)
arg_parser.add_argument(
"--service_acount_name", type=str, required=False, default="default", help="The service account name to use."
)
arg_parser.add_argument(
"--image",
type=str,
required=True,
help="Docker image for the Ray cluster pods.",
)
arg_parser.add_argument(
"--worker_accelerator",
nargs="+",
type=str,
default=["nvidia-l4"],
help="GPU accelerator name. Supply more than one for heterogeneous resources.",
)
arg_parser = util.add_resource_arguments(arg_parser, cluster_create_defaults=True)
arg_parser.add_argument(
"--num_clusters",
type=int,
default=1,
help="How many Ray Clusters to create.",
)
arg_parser.add_argument(
"--num_head_cpu",
type=float, # to be able to schedule partial CPU heads
default=8,
help="The number of CPUs to give the Ray head.",
)
arg_parser.add_argument("--head_ram_gb", type=int, default=8, help="How many gigs of ram to give the Ray head")
args = arg_parser.parse_args()
return util.fill_in_missing_resources(args, cluster_creation_flag=True)
def main():
args = parse_args()
if "head" in args.name:
raise ValueError("For compatibility with other scripts, do not include head in the name")
if args.num_clusters == 1:
apply_manifest(args)
else:
default_name = args.name
for i in range(args.num_clusters):
args.name = default_name + "-" + str(i)
apply_manifest(args)
if __name__ == "__main__":
main()
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import logging
import multiprocessing as mp
import os
import sys
from concurrent.futures import ProcessPoolExecutor, as_completed
from torch.utils.tensorboard import SummaryWriter
import mlflow
from mlflow.tracking import MlflowClient
def setup_logging(level=logging.INFO):
logging.basicConfig(level=level, format="%(asctime)s - %(levelname)s - %(message)s")
def get_existing_runs(download_dir: str) -> set[str]:
"""Get set of run IDs that have already been downloaded."""
existing_runs = set()
tensorboard_dir = os.path.join(download_dir, "tensorboard")
if os.path.exists(tensorboard_dir):
for entry in os.listdir(tensorboard_dir):
if entry.startswith("run_"):
existing_runs.add(entry[4:])
return existing_runs
def process_run(args):
"""Convert MLflow run to TensorBoard format."""
run_id, download_dir, tracking_uri = args
try:
# Set up MLflow client
mlflow.set_tracking_uri(tracking_uri)
client = MlflowClient()
run = client.get_run(run_id)
# Create TensorBoard writer
tensorboard_log_dir = os.path.join(download_dir, "tensorboard", f"run_{run_id}")
writer = SummaryWriter(log_dir=tensorboard_log_dir)
# Log parameters
for key, value in run.data.params.items():
writer.add_text(f"params/{key}", str(value))
# Log metrics with history
for key in run.data.metrics.keys():
history = client.get_metric_history(run_id, key)
for m in history:
writer.add_scalar(f"metrics/{key}", m.value, m.step)
# Log tags
for key, value in run.data.tags.items():
writer.add_text(f"tags/{key}", str(value))
writer.close()
return run_id, True
except Exception:
return run_id, False
def download_experiment_tensorboard_logs(uri: str, experiment_name: str, download_dir: str) -> None:
"""Download MLflow experiment logs and convert to TensorBoard format."""
logger = logging.getLogger(__name__)
try:
# Set up MLflow
mlflow.set_tracking_uri(uri)
logger.info(f"Connected to MLflow tracking server at {uri}")
# Get experiment
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
raise ValueError(f"Experiment '{experiment_name}' not found at URI '{uri}'.")
# Get all runs
runs = mlflow.search_runs([experiment.experiment_id])
logger.info(f"Found {len(runs)} total runs in experiment '{experiment_name}'")
# Check existing runs
existing_runs = get_existing_runs(download_dir)
logger.info(f"Found {len(existing_runs)} existing runs in {download_dir}")
# Create directory structure
os.makedirs(os.path.join(download_dir, "tensorboard"), exist_ok=True)
# Process new runs
new_run_ids = [run.run_id for _, run in runs.iterrows() if run.run_id not in existing_runs]
if not new_run_ids:
logger.info("No new runs to process")
return
logger.info(f"Processing {len(new_run_ids)} new runs...")
# Process runs in parallel
num_processes = min(mp.cpu_count(), len(new_run_ids))
processed = 0
with ProcessPoolExecutor(max_workers=num_processes) as executor:
future_to_run = {
executor.submit(process_run, (run_id, download_dir, uri)): run_id for run_id in new_run_ids
}
for future in as_completed(future_to_run):
run_id = future_to_run[future]
try:
run_id, success = future.result()
processed += 1
if success:
logger.info(f"[{processed}/{len(new_run_ids)}] Successfully processed run {run_id}")
else:
logger.error(f"[{processed}/{len(new_run_ids)}] Failed to process run {run_id}")
except Exception as e:
logger.error(f"Error processing run {run_id}: {e}")
logger.info(f"\nAll data saved to {download_dir}/tensorboard")
except Exception as e:
logger.error(f"Error during download: {e}")
raise
def main():
parser = argparse.ArgumentParser(description="Download MLflow experiment logs for TensorBoard visualization.")
parser.add_argument("--uri", required=True, help="The MLflow tracking URI (e.g., http://localhost:5000)")
parser.add_argument("--experiment-name", required=True, help="Name of the experiment to download")
parser.add_argument("--download-dir", required=True, help="Directory to save TensorBoard logs")
parser.add_argument("--debug", action="store_true", help="Enable debug logging")
args = parser.parse_args()
setup_logging(level=logging.DEBUG if args.debug else logging.INFO)
try:
download_experiment_tensorboard_logs(args.uri, args.experiment_name, args.download_dir)
print("\nSuccess! To view the logs, run:")
print(f"tensorboard --logdir {os.path.join(args.download_dir, 'tensorboard')}")
except Exception as e:
logging.error(f"Failed to download experiment logs: {e}")
sys.exit(1)
if __name__ == "__main__":
main()
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import os
import time
from concurrent.futures import ThreadPoolExecutor
from ray import job_submission
"""
This script submits aggregate job(s) to cluster(s) described in a
config file containing ``name: <NAME> address: http://<IP>:<PORT>`` on
a new line for each cluster. For KubeRay clusters, this file
can be automatically created with :file:`grok_cluster_with_kubectl.py`
Aggregate job(s) are matched with cluster(s) via the following relation:
cluster_line_index_submitted_to = job_index % total_cluster_count
Aggregate jobs are separated by the * delimiter. The ``--aggregate_jobs`` argument must be
the last argument supplied to the script.
An aggregate job could be a :file:`../tuner.py` tuning job, which automatically
creates several individual jobs when started on a cluster. Alternatively, an aggregate job
could be a :file:'../wrap_resources.py` resource-wrapped job,
which may contain several individual sub-jobs separated by
the + delimiter.
If there are more aggregate jobs than cluster(s), aggregate jobs will be submitted
as clusters become available via the defined relation above. If there are less aggregate job(s)
than clusters, some clusters will not receive aggregate job(s). The maximum number of
aggregate jobs that can be run simultaneously is equal to the number of workers created by
default by a ThreadPoolExecutor on the machine submitting jobs due to fetching the log output after
jobs finish, which is unlikely to constrain overall-job submission.
Usage:
.. code-block:: bash
# Example; submitting a tuning job
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
--aggregate_jobs /workspace/isaaclab/source/standalone/workflows/ray/tuner.py \
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <ML_FLOW_URI>
# Example: Submitting resource wrapped job
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py --aggregate_jobs wrap_resources.py --sub_jobs ./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-v0 --headless+./isaaclab.sh -p source/standalone/workflows/rl_games/train.py --task Isaac-Cartpole-RGB-Camera-Direct-v0 --headless --enable_cameras agent.params.config.max_epochs=150
# For all command line arguments
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py -h
"""
script_directory = os.path.dirname(os.path.abspath(__file__))
CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"}
def read_cluster_spec(fn: str | None = None) -> list[dict]:
if fn is None:
cluster_spec_path = os.path.expanduser("~/.cluster_config")
else:
cluster_spec_path = os.path.expanduser(fn)
if not os.path.exists(cluster_spec_path):
raise FileNotFoundError(f"Cluster spec file not found at {cluster_spec_path}")
clusters = []
with open(cluster_spec_path) as f:
for line in f:
parts = line.strip().split(" ")
http_address = parts[3]
cluster_info = {"name": parts[1], "address": http_address}
print(f"[INFO] Setting {cluster_info['name']}") # with {cluster_info['num_gpu']} GPUs.")
clusters.append(cluster_info)
return clusters
def submit_job(cluster: dict, job_command: str) -> None:
"""
Submits a job to a single cluster, prints the final result and Ray dashboard URL at the end.
"""
address = cluster["address"]
cluster_name = cluster["name"]
print(f"[INFO]: Submitting job to cluster '{cluster_name}' at {address}") # with {num_gpus} GPUs.")
client = job_submission.JobSubmissionClient(address)
runtime_env = {"working_dir": CONFIG["working_dir"], "executable": CONFIG["executable"]}
print(f"[INFO]: Checking contents of the directory: {CONFIG['working_dir']}")
try:
dir_contents = os.listdir(CONFIG["working_dir"])
print(f"[INFO]: Directory contents: {dir_contents}")
except Exception as e:
print(f"[INFO]: Failed to list directory contents: {str(e)}")
entrypoint = f"{CONFIG['executable']} {job_command}"
print(f"[INFO]: Attempting entrypoint {entrypoint=} in cluster {cluster}")
job_id = client.submit_job(entrypoint=entrypoint, runtime_env=runtime_env)
status = client.get_job_status(job_id)
while status in [job_submission.JobStatus.PENDING, job_submission.JobStatus.RUNNING]:
time.sleep(5)
status = client.get_job_status(job_id)
final_logs = client.get_job_logs(job_id)
print("----------------------------------------------------")
print(f"[INFO]: Cluster {cluster_name} Logs: \n")
print(final_logs)
print("----------------------------------------------------")
def submit_jobs_to_clusters(jobs: list[str], clusters: list[dict]) -> None:
"""
Submit all jobs to their respective clusters, cycling through clusters if there are more jobs than clusters.
"""
if not clusters:
raise ValueError("No clusters available for job submission.")
if len(jobs) < len(clusters):
print("[INFO]: Less jobs than clusters, some clusters will not receive jobs")
elif len(jobs) == len(clusters):
print("[INFO]: Exactly one job per cluster")
else:
print("[INFO]: More jobs than clusters, jobs submitted as clusters become available.")
with ThreadPoolExecutor() as executor:
for idx, job_command in enumerate(jobs):
# Cycle through clusters using modulus to wrap around if there are more jobs than clusters
cluster = clusters[idx % len(clusters)]
executor.submit(submit_job, cluster, job_command)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Submit multiple GPU jobs to multiple Ray clusters.")
parser.add_argument("--config_file", default="~/.cluster_config", help="The cluster config path.")
parser.add_argument(
"--aggregate_jobs",
type=str,
nargs=argparse.REMAINDER,
help="This should be last argument. The aggregate jobs to submit separated by the * delimiter.",
)
args = parser.parse_args()
if args.aggregate_jobs is not None:
jobs = " ".join(args.aggregate_jobs)
formatted_jobs = jobs.split("*")
if len(formatted_jobs) > 1:
print("Warning; Split jobs by cluster with the * delimiter")
else:
formatted_jobs = []
print(f"[INFO]: Isaac Ray Wrapper received jobs {formatted_jobs=}")
clusters = read_cluster_spec(args.config_file)
submit_jobs_to_clusters(formatted_jobs, clusters)
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import importlib.util
import os
import sys
from time import sleep
import ray
import util
from ray import air, tune
from ray.tune.search.optuna import OptunaSearch
from ray.tune.search.repeater import Repeater
"""
This script breaks down an aggregate tuning job, as defined by a hyperparameter sweep configuration,
into individual jobs (shell commands) to run on the GPU-enabled nodes of the cluster.
By default, (unless combined as a sub-job in a resource-wrapped aggregate job), one worker is created
for each GPU-enabled node in the cluster for each individual job.
Each hyperparameter sweep configuration should include the workflow,
runner arguments, and hydra arguments to vary.
This assumes that all workers in a cluster are homogeneous. For heterogeneous workloads,
create several heterogeneous clusters (with homogeneous nodes in each cluster),
then submit several overall-cluster jobs with :file:`../submit_job.py`.
KubeRay clusters on Google GKE can be created with :file:`../launch.py`
To report tune metrics on clusters, a running MLFlow server with a known URI that the cluster has
access to is required. For KubeRay clusters configured with :file:`../launch.py`, this is included
automatically, and can be easily found with with :file:`grok_cluster_with_kubectl.py`
Usage:
.. code-block:: bash
./isaaclab.sh -p source/standalone/workflows/ray/tuner.py -h
# Examples
# Local (not within a docker container, when within a local docker container, do not supply run_mode argument)
./isaaclab.sh -p source/standalone/workflows/ray/tuner.py --run_mode local \
--cfg_file source/standalone/workflows/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg
# Local docker: start the ray server and run above command in the same running container without run_mode arg
# Remote (run grok cluster or create config file mentioned in :file:`submit_job.py`)
./isaaclab.sh -p source/standalone/workflows/ray/submit_job.py \
--aggregate_jobs tuner.py \
--cfg_file hyperparameter_tuning/vision_cartpole_cfg.py \
--cfg_class CartpoleRGBNoTuneJobCfg --mlflow_uri <MLFLOW_URI_FROM_GROK_OR_MANUAL>
"""
DOCKER_PREFIX = "/workspace/isaaclab/"
BASE_DIR = os.path.expanduser("~")
PYTHON_EXEC = "./isaaclab.sh -p"
WORKFLOW = "source/standalone/workflows/rl_games/train.py"
NUM_WORKERS_PER_NODE = 1 # needed for local parallelism
class IsaacLabTuneTrainable(tune.Trainable):
"""The Isaac Lab Ray Tune Trainable.
This class uses the standalone workflows to start jobs, along with the hydra integration.
This class achieves Ray-based logging through reading the tensorboard logs from
the standalone workflows. This depends on a config generated in the format of
:class:`JobCfg`
"""
def setup(self, config: dict) -> None:
"""Get the invocation command, return quick for easy scheduling."""
self.data = None
self.invoke_cmd = util.get_invocation_command_from_cfg(cfg=config, python_cmd=PYTHON_EXEC, workflow=WORKFLOW)
print(f"[INFO]: Recovered invocation with {self.invoke_cmd}")
self.experiment = None
def reset_config(self, new_config):
"""Allow environments to be re-used by fetching a new invocation command"""
self.setup(new_config)
return True
def step(self) -> dict:
if self.experiment is None: # start experiment
# When including this as first step instead of setup, experiments get scheduled faster
# Don't want to block the scheduler while the experiment spins up
print(f"[INFO]: Invoking experiment as first step with {self.invoke_cmd}...")
experiment = util.execute_job(
self.invoke_cmd,
identifier_string="",
extract_experiment=True,
persistent_dir=BASE_DIR,
)
self.experiment = experiment
print(f"[INFO]: Tuner recovered experiment info {experiment}")
self.proc = experiment["proc"]
self.experiment_name = experiment["experiment_name"]
self.isaac_logdir = experiment["logdir"]
self.tensorboard_logdir = self.isaac_logdir + f"/{self.experiment_name}/summaries"
self.done = False
if self.proc is None:
raise ValueError("Could not start trial.")
if self.proc.poll() is not None: # process finished, signal finish
self.data["done"] = True
print("[INFO]: Process finished, returning...")
else: # wait until the logs are ready or fresh
data = util.load_tensorboard_logs(self.tensorboard_logdir)
while data is None:
data = util.load_tensorboard_logs(self.tensorboard_logdir)
sleep(2) # Lazy report metrics to avoid performance overhead
if self.data is not None:
while util._dicts_equal(data, self.data):
data = util.load_tensorboard_logs(self.tensorboard_logdir)
sleep(2) # Lazy report metrics to avoid performance overhead
self.data = data
self.data["done"] = False
return self.data
def default_resource_request(self):
"""How many resources each trainable uses. Assumes homogeneous resources across gpu nodes,
and that each trainable is meant for one node, where it uses all available resources."""
resources = util.get_gpu_node_resources(one_node_only=True)
if NUM_WORKERS_PER_NODE != 1:
print("[WARNING]: Splitting node into more than one worker")
return tune.PlacementGroupFactory(
[{"CPU": resources["CPU"] / NUM_WORKERS_PER_NODE, "GPU": resources["GPU"] / NUM_WORKERS_PER_NODE}],
strategy="STRICT_PACK",
)
def invoke_tuning_run(cfg: dict, args: argparse.Namespace) -> None:
"""Invoke an Isaac-Ray tuning run.
Log either to a local directory or to MLFlow.
Args:
cfg: Configuration dictionary extracted from job setup
args: Command-line arguments related to tuning.
"""
# Allow for early exit
os.environ["TUNE_DISABLE_STRICT_METRIC_CHECKING"] = "1"
print("[WARNING]: Not saving checkpoints, just running experiment...")
print("[INFO]: Model parameters and metrics will be preserved.")
print("[WARNING]: For homogeneous cluster resources only...")
# Get available resources
resources = util.get_gpu_node_resources()
print(f"[INFO]: Available resources {resources}")
if not ray.is_initialized():
ray.init(
address=args.ray_address,
log_to_driver=True,
num_gpus=len(resources),
)
print(f"[INFO]: Using config {cfg}")
# Configure the search algorithm and the repeater
searcher = OptunaSearch(
metric=args.metric,
mode=args.mode,
)
repeat_search = Repeater(searcher, repeat=args.repeat_run_count)
if args.run_mode == "local": # Standard config, to file
run_config = air.RunConfig(
storage_path="/tmp/ray",
name=f"IsaacRay-{args.cfg_class}-tune",
verbose=1,
checkpoint_config=air.CheckpointConfig(
checkpoint_frequency=0, # Disable periodic checkpointing
checkpoint_at_end=False, # Disable final checkpoint
),
)
elif args.run_mode == "remote": # MLFlow, to MLFlow server
mlflow_callback = MLflowLoggerCallback(
tracking_uri=args.mlflow_uri,
experiment_name=f"IsaacRay-{args.cfg_class}-tune",
save_artifact=False,
tags={"run_mode": "remote", "cfg_class": args.cfg_class},
)
run_config = ray.train.RunConfig(
name="mlflow",
storage_path="/tmp/ray",
callbacks=[mlflow_callback],
checkpoint_config=ray.train.CheckpointConfig(checkpoint_frequency=0, checkpoint_at_end=False),
)
else:
raise ValueError("Unrecognized run mode.")
# Configure the tuning job
tuner = tune.Tuner(
IsaacLabTuneTrainable,
param_space=cfg,
tune_config=tune.TuneConfig(
search_alg=repeat_search,
num_samples=args.num_samples,
reuse_actors=True,
),
run_config=run_config,
)
# Execute the tuning
tuner.fit()
# Save results to mounted volume
if args.run_mode == "local":
print("[DONE!]: Check results with tensorboard dashboard")
else:
print("[DONE!]: Check results with MLFlow dashboard")
class JobCfg:
"""To be compatible with :meth: invoke_tuning_run and :class:IsaacLabTuneTrainable,
at a minimum, the tune job should inherit from this class."""
def __init__(self, cfg):
assert "runner_args" in cfg, "No runner arguments specified."
assert "--task" in cfg["runner_args"], "No task specified."
assert "hydra_args" in cfg, "No hypeparameters specified."
self.cfg = cfg
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Tune Isaac Lab hyperparameters.")
parser.add_argument("--ray_address", type=str, default="auto", help="the Ray address.")
parser.add_argument(
"--cfg_file",
type=str,
default="hyperparameter_tuning/vision_cartpole_cfg.py",
required=False,
help="The relative filepath where a hyperparameter sweep is defined",
)
parser.add_argument(
"--cfg_class",
type=str,
default="CartpoleRGBNoTuneJobCfg",
required=False,
help="Name of the hyperparameter sweep class to use",
)
parser.add_argument(
"--run_mode",
choices=["local", "remote"],
default="remote",
help=(
"Set to local to use ./isaaclab.sh -p python, set to "
"remote to use /workspace/isaaclab/isaaclab.sh -p python"
),
)
parser.add_argument(
"--workflow",
default=None, # populated with RL Games
help="The absolute path of the workflow to use for the experiment. By default, RL Games is used.",
)
parser.add_argument(
"--mlflow_uri",
type=str,
default=None,
required=False,
help="The MLFlow Uri.",
)
parser.add_argument(
"--num_workers_per_node",
type=int,
default=1,
help="Number of workers to run on each GPU node. Only supply for parallelism on multi-gpu nodes",
)
parser.add_argument("--metric", type=str, default="rewards/time", help="What metric to tune for.")
parser.add_argument(
"--mode",
choices=["max", "min"],
default="max",
help="What to optimize the metric to while tuning",
)
parser.add_argument(
"--num_samples",
type=int,
default=100,
help="How many hyperparameter runs to try total.",
)
parser.add_argument(
"--repeat_run_count",
type=int,
default=3,
help="How many times to repeat each hyperparameter config.",
)
args = parser.parse_args()
NUM_WORKERS_PER_NODE = args.num_workers_per_node
print(f"[INFO]: Using {NUM_WORKERS_PER_NODE} workers per node.")
if args.run_mode == "remote":
BASE_DIR = DOCKER_PREFIX # ensure logs are dumped to persistent location
PYTHON_EXEC = DOCKER_PREFIX + PYTHON_EXEC[2:]
if args.workflow is None:
WORKFLOW = DOCKER_PREFIX + WORKFLOW
else:
WORKFLOW = args.workflow
print(f"[INFO]: Using remote mode {PYTHON_EXEC=} {WORKFLOW=}")
if args.mlflow_uri is not None:
import mlflow
mlflow.set_tracking_uri(args.mlflow_uri)
from ray.air.integrations.mlflow import MLflowLoggerCallback
else:
raise ValueError("Please provide a result MLFLow URI server.")
else: # local
PYTHON_EXEC = os.getcwd() + "/" + PYTHON_EXEC[2:]
if args.workflow is None:
WORKFLOW = os.getcwd() + "/" + WORKFLOW
else:
WORKFLOW = args.workflow
BASE_DIR = os.getcwd()
print(f"[INFO]: Using local mode {PYTHON_EXEC=} {WORKFLOW=}")
file_path = args.cfg_file
class_name = args.cfg_class
print(f"[INFO]: Attempting to use sweep config from {file_path=} {class_name=}")
module_name = os.path.splitext(os.path.basename(file_path))[0]
spec = importlib.util.spec_from_file_location(module_name, file_path)
module = importlib.util.module_from_spec(spec)
sys.modules[module_name] = module
spec.loader.exec_module(module)
print(f"[INFO]: Successfully imported {module_name} from {file_path}")
if hasattr(module, class_name):
ClassToInstantiate = getattr(module, class_name)
print(f"[INFO]: Found correct class {ClassToInstantiate}")
instance = ClassToInstantiate()
print(f"[INFO]: Successfully instantiated class '{class_name}' from {file_path}")
cfg = instance.cfg
print(f"[INFO]: Grabbed the following hyperparameter sweep config: \n {cfg}")
invoke_tuning_run(cfg, args)
else:
raise AttributeError(f"[ERROR]:Class '{class_name}' not found in {file_path}")
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import os
import re
import subprocess
from datetime import datetime
from math import isclose
import ray
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
def load_tensorboard_logs(directory: str) -> dict:
"""From a tensorboard directory, get the latest scalar values.
Args:
directory: The directory of the tensorboard logging.
Returns:
The latest available scalar values.
"""
# Initialize the event accumulator with a size guidance for only the latest entry
size_guidance = {"scalars": 1} # Load only the latest entry for scalars
event_acc = EventAccumulator(directory, size_guidance=size_guidance)
event_acc.Reload() # Load all data from the directory
# Extract the latest scalars logged
latest_scalars = {}
for tag in event_acc.Tags()["scalars"]:
events = event_acc.Scalars(tag)
if events: # Check if there is at least one entry
latest_event = events[-1] # Get the latest entry
latest_scalars[tag] = latest_event.value
return latest_scalars
def get_invocation_command_from_cfg(
cfg: dict,
python_cmd: str = "/workspace/isaaclab/isaaclab.sh -p",
workflow: str = "source/standalone/workflows/rl_games/train.py",
) -> str:
"""Generate command with proper Hydra arguments"""
runner_args = []
hydra_args = []
def process_args(args, target_list, is_hydra=False):
for key, value in args.items():
if not is_hydra:
if key.endswith("_singleton"):
target_list.append(value)
elif key.startswith("--"):
target_list.append(f"{key} {value}") # Space instead of = for runner args
else:
target_list.append(f"{value}")
else:
if isinstance(value, list):
# Check the type of the first item to determine formatting
if value and isinstance(value[0], dict):
# Handle list of dictionaries (e.g., CNN convs)
formatted_items = [f"{{{','.join(f'{k}:{v}' for k, v in item.items())}}}" for item in value]
else:
# Handle list of primitives (e.g., MLP units)
formatted_items = [str(x) for x in value]
target_list.append(f"'{key}=[{','.join(formatted_items)}]'")
elif isinstance(value, str) and ("{" in value or "}" in value):
target_list.append(f"'{key}={value}'")
else:
target_list.append(f"{key}={value}")
print(f"[INFO]: Starting workflow {workflow}")
process_args(cfg["runner_args"], runner_args)
print(f"[INFO]: Retrieved workflow runner args: {runner_args}")
process_args(cfg["hydra_args"], hydra_args, is_hydra=True)
print(f"[INFO]: Retrieved hydra args: {hydra_args}")
invoke_cmd = f"{python_cmd} {workflow} "
invoke_cmd += " ".join(runner_args) + " " + " ".join(hydra_args)
return invoke_cmd
@ray.remote
def remote_execute_job(
job_cmd: str, identifier_string: str, test_mode: bool = False, extract_experiment: bool = False
) -> str | dict:
"""This method has an identical signature to :meth:`execute_job`, with the ray remote decorator"""
return execute_job(
job_cmd=job_cmd, identifier_string=identifier_string, test_mode=test_mode, extract_experiment=extract_experiment
)
def execute_job(
job_cmd: str,
identifier_string: str = "job 0",
test_mode: bool = False,
extract_experiment: bool = False,
persistent_dir: str | None = None,
log_all_output: bool = False,
) -> str | dict:
"""Issue a job (shell command).
Args:
job_cmd: The shell command to run.
identifier_string: What prefix to add to make logs easier to differentiate
across clusters or jobs. Defaults to "job 0".
test_mode: When true, only run 'nvidia-smi'. Defaults to False.
extract_experiment: When true, search for experiment details from a training run. Defaults to False.
persistent_dir: When supplied, change to run the directory in a persistent
directory. Can be used to avoid losing logs in the /tmp directory. Defaults to None.
log_all_output: When true, print all output to the console. Defaults to False.
Raises:
ValueError: If the job is unable to start, or throws an error. Most likely to happen
due to running out of memory.
Returns:
Relevant information from the job
"""
start_time = datetime.now().strftime("%H:%M:%S.%f")
result_details = [f"{identifier_string}: ---------------------------------\n"]
result_details.append(f"{identifier_string}:[INFO]: Invocation {job_cmd} \n")
node_id = ray.get_runtime_context().get_node_id()
result_details.append(f"{identifier_string}:[INFO]: Ray Node ID: {node_id} \n")
if test_mode:
import torch
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=name,memory.free,serial", "--format=csv,noheader,nounits"],
capture_output=True,
check=True,
text=True,
)
output = result.stdout.strip().split("\n")
for gpu_info in output:
name, memory_free, serial = gpu_info.split(", ")
result_details.append(
f"{identifier_string}[INFO]: Name: {name}|Memory Available: {memory_free} MB|Serial Number"
f" {serial} \n"
)
# Get GPU count from PyTorch
num_gpus_detected = torch.cuda.device_count()
result_details.append(f"{identifier_string}[INFO]: Detected GPUs from PyTorch: {num_gpus_detected} \n")
# Check CUDA_VISIBLE_DEVICES and count the number of visible GPUs
cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES")
if cuda_visible_devices:
visible_devices_count = len(cuda_visible_devices.split(","))
result_details.append(
f"{identifier_string}[INFO]: GPUs visible via CUDA_VISIBLE_DEVICES: {visible_devices_count} \n"
)
else:
visible_devices_count = len(output) # All GPUs visible if CUDA_VISIBLE_DEVICES is not set
result_details.append(
f"{identifier_string}[INFO]: CUDA_VISIBLE_DEVICES not set; all GPUs visible"
f" ({visible_devices_count}) \n"
)
# If PyTorch GPU count disagrees with nvidia-smi, reset CUDA_VISIBLE_DEVICES and rerun detection
if num_gpus_detected != len(output):
result_details.append(
f"{identifier_string}[WARNING]: PyTorch and nvidia-smi disagree on GPU count! Re-running with all"
" GPUs visible. \n"
)
result_details.append(f"{identifier_string}[INFO]: This shows that GPU resources were isolated.\n")
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in range(len(output))])
num_gpus_detected_after_reset = torch.cuda.device_count()
result_details.append(
f"{identifier_string}[INFO]: After setting CUDA_VISIBLE_DEVICES, PyTorch detects"
f" {num_gpus_detected_after_reset} GPUs \n"
)
except subprocess.CalledProcessError as e:
print(f"Error calling nvidia-smi: {e.stderr}")
result_details.append({"error": "Failed to retrieve GPU information"})
else:
if persistent_dir:
og_dir = os.getcwd()
os.chdir(persistent_dir)
process = subprocess.Popen(
job_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, bufsize=1
)
if persistent_dir:
os.chdir(og_dir)
experiment_name = None
logdir = None
experiment_info_pattern = re.compile("Exact experiment name requested from command line: (.+)")
logdir_pattern = re.compile(r"\[INFO\] Logging experiment in directory: (.+)$")
err_pattern = re.compile("There was an error (.+)$")
with process.stdout as stdout:
for line in iter(stdout.readline, ""):
line = line.strip()
result_details.append(f"{identifier_string}: {line} \n")
if log_all_output:
print(f"{identifier_string}: {line}")
if extract_experiment:
exp_match = experiment_info_pattern.search(line)
log_match = logdir_pattern.search(line)
err_match = err_pattern.search(line)
if err_match:
raise ValueError(f"Encountered an error during trial run. {' '.join(result_details)}")
if exp_match:
experiment_name = exp_match.group(1)
if log_match:
logdir = log_match.group(1)
if experiment_name and logdir:
result = {
"experiment_name": experiment_name,
"logdir": logdir,
"proc": process,
"result": " ".join(result_details),
}
return result
with process.stderr as stderr:
for line in iter(stderr.readline, ""):
line = line.strip()
result_details.append(f"{identifier_string}: {line}")
print(f"{identifier_string}: {line}")
process.wait() # Wait for the subprocess to finish naturally if not exited early
now = datetime.now().strftime("%H:%M:%S.%f")
completion_info = f"\n[INFO]: {identifier_string}: Job Started at {start_time}, completed at {now}\n"
print(completion_info)
result_details.append(completion_info)
return " ".join(result_details)
def get_gpu_node_resources(
total_resources: bool = False,
one_node_only: bool = False,
include_gb_ram: bool = False,
include_id: bool = False,
ray_address: str = "auto",
) -> list[dict] | dict:
"""Get information about available GPU node resources.
Args:
total_resources: When true, return total available resources. Defaults to False.
one_node_only: When true, return resources for a single node. Defaults to False.
include_gb_ram: Set to true to convert MB to GB in result
include_id: Set to true to include node ID
ray_address: The ray address to connect to.
Returns:
Resource information for all nodes, sorted by descending GPU count, then descending CPU
count, then descending RAM capacity, and finally by node ID in ascending order if available,
or simply the resource for a single node if requested.
"""
if not ray.is_initialized():
ray.init(address=ray_address)
nodes = ray.nodes()
node_resources = []
total_cpus = 0
total_gpus = 0
total_memory = 0 # in bytes
for node in nodes:
if node["Alive"] and "GPU" in node["Resources"]:
node_id = node["NodeID"]
resources = node["Resources"]
cpus = resources.get("CPU", 0)
gpus = resources.get("GPU", 0)
memory = resources.get("memory", 0)
node_resources.append({"CPU": cpus, "GPU": gpus, "memory": memory})
if include_id:
node_resources[-1]["id"] = node_id
if include_gb_ram:
node_resources[-1]["ram_gb"] = memory / 1024**3
total_cpus += cpus
total_gpus += gpus
total_memory += memory
node_resources = sorted(node_resources, key=lambda x: (-x["GPU"], -x["CPU"], -x["memory"], x.get("id", "")))
if total_resources:
# Return summed total resources
return {"CPU": total_cpus, "GPU": total_gpus, "memory": total_memory}
if one_node_only and node_resources:
return node_resources[0]
return node_resources
def add_resource_arguments(
arg_parser: argparse.ArgumentParser,
defaults: list | None = None,
cluster_create_defaults: bool = False,
) -> argparse.ArgumentParser:
"""Add resource arguments to a cluster; this is shared across both
wrapping resources and launching clusters.
Args:
arg_parser: the argparser to add the arguments to. This argparser is mutated.
defaults: The default values for GPUs, CPUs, RAM, and Num Workers
cluster_create_defaults: Set to true to populate reasonable defaults for creating clusters.
Returns:
The argparser with the standard resource arguments.
"""
if defaults is None:
if cluster_create_defaults:
defaults = [[1], [8], [16], [1]]
else:
defaults = [None, None, None, [1]]
arg_parser.add_argument(
"--gpu_per_worker",
nargs="+",
type=int,
default=defaults[0],
help="Number of GPUs per worker node. Supply more than one for heterogeneous resources",
)
arg_parser.add_argument(
"--cpu_per_worker",
nargs="+",
type=int,
default=defaults[1],
help="Number of CPUs per worker node. Supply more than one for heterogeneous resources",
)
arg_parser.add_argument(
"--ram_gb_per_worker",
nargs="+",
type=int,
default=defaults[2],
help="RAM in GB per worker node. Supply more than one for heterogeneous resources.",
)
arg_parser.add_argument(
"--num_workers",
nargs="+",
type=int,
default=defaults[3],
help="Number of desired workers. Supply more than one for heterogeneous resources.",
)
return arg_parser
def fill_in_missing_resources(
args: argparse.Namespace, resources: dict | None = None, cluster_creation_flag: bool = False, policy: callable = max
):
"""Normalize the lengths of resource lists based on the longest list provided."""
print("[INFO]: Filling in missing command line arguments with best guess...")
if resources is None:
resources = {
"gpu_per_worker": args.gpu_per_worker,
"cpu_per_worker": args.cpu_per_worker,
"ram_gb_per_worker": args.ram_gb_per_worker,
"num_workers": args.num_workers,
}
if cluster_creation_flag:
cluster_creation_resources = {"worker_accelerator": args.worker_accelerator}
resources.update(cluster_creation_resources)
# Calculate the maximum length of any list
max_length = max(len(v) for v in resources.values())
print("[INFO]: Resource list lengths:")
for key, value in resources.items():
print(f"[INFO] {key}: {len(value)} values {value}")
# Extend each list to match the maximum length using the maximum value in each list
for key, value in resources.items():
potential_value = getattr(args, key)
if potential_value is not None:
max_value = policy(policy(value), policy(potential_value))
else:
max_value = policy(value)
extension_length = max_length - len(value)
if extension_length > 0: # Only extend if the current list is shorter than max_length
print(f"\n[WARNING]: Resource '{key}' needs extension:")
print(f"[INFO] Current length: {len(value)}")
print(f"[INFO] Target length: {max_length}")
print(f"[INFO] Filling in {extension_length} missing values with {max_value}")
print(f"[INFO] To avoid auto-filling, provide {extension_length} more {key} value(s)")
value.extend([max_value] * extension_length)
setattr(args, key, value)
resources[key] = value
print(f"[INFO] Final {key} values: {getattr(args, key)}")
print("[INFO]: Done filling in command line arguments...\n\n")
return args
def populate_isaac_ray_cfg_args(cfg: dict = {}) -> dict:
"""Small utility method to create empty fields if needed for a configuration."""
if "runner_args" not in cfg:
cfg["runner_args"] = {}
if "hydra_args" not in cfg:
cfg["hydra_args"] = {}
return cfg
def _dicts_equal(d1: dict, d2: dict, tol=1e-9) -> bool:
"""Check if two dicts are equal; helps ensure only new logs are returned."""
if d1.keys() != d2.keys():
return False
for key in d1:
if isinstance(d1[key], float) and isinstance(d2[key], float):
if not isclose(d1[key], d2[key], abs_tol=tol):
return False
elif d1[key] != d2[key]:
return False
return True
# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
import argparse
import ray
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
import source.standalone.workflows.ray.util as util
"""
This script dispatches sub-job(s) (either individual jobs or tuning aggregate jobs)
to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate
job. If no desired compute resources for each sub-job are specified,
this script creates one worker per available node for each node with GPU(s) in the cluster.
If the desired resources for each sub-job is specified,
the maximum number of workers possible with the desired resources are created for each node
with GPU(s) in the cluster. It is also possible to split available node resources for each node
into the desired number of workers with the ``--num_workers`` flag, to be able to easily
parallelize sub-jobs on multi-GPU nodes. Due to Isaac Lab requiring a GPU,
this ignores all CPU only nodes such as loggers.
Sub-jobs are matched with node(s) in a cluster via the following relation:
sorted_nodes = Node sorted by descending GPUs, then descending CPUs, then descending RAM, then node ID
node_submitted_to = sorted_nodes[job_index % total_node_count]
To check the ordering of sorted nodes, supply the ``--test`` argument and run the script.
Sub-jobs are separated by the + delimiter. The ``--sub_jobs`` argument must be the last
argument supplied to the script.
If there is more than one available worker, and more than one sub-job,
sub-jobs will be executed in parallel. If there are more sub-jobs than workers, sub-jobs will
be dispatched to workers as they become available. There is no limit on the number
of sub-jobs that can be near-simultaneously submitted.
This script is meant to be executed on a Ray cluster head node as an aggregate cluster job.
To submit aggregate cluster jobs such as this script to one or more remote clusters,
see :file:`../submit_isaac_ray_job.py`.
KubeRay clusters on Google GKE can be created with :file:`../launch.py`
Usage:
.. code-block:: bash
# **Ensure that sub-jobs are separated by the ``+`` delimiter.**
# Generic Templates-----------------------------------
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py -h
# No resource isolation; no parallelization:
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py
--sub_jobs <JOB0>+<JOB1>+<JOB2>
# Automatic Resource Isolation; Example A: needed for parallelization
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py \
--num_workers <NUM_TO_DIVIDE_TOTAL_RESOURCES_BY> \
--sub_jobs <JOB0>+<JOB1>
# Manual Resource Isolation; Example B: needed for parallelization
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py --num_cpu_per_worker <CPU> \
--gpu_per_worker <GPU> --ram_gb_per_worker <RAM> --sub_jobs <JOB0>+<JOB1>
# Manual Resource Isolation; Example C: Needed for parallelization, for heterogeneous workloads
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py --num_cpu_per_worker <CPU> \
--gpu_per_worker <GPU1> <GPU2> --ram_gb_per_worker <RAM> --sub_jobs <JOB0>+<JOB1>
# to see all arguments
./isaaclab.sh -p source/standalone/workflows/ray/wrap_resources.py -h
"""
def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
"""
Provided a list of jobs, dispatch jobs to one worker per available node,
unless otherwise specified by resource constraints.
Args:
jobs: bash commands to execute on a Ray cluster
args: The arguments for resource allocation
"""
if not ray.is_initialized():
ray.init(address=args.ray_address, log_to_driver=True)
job_results = []
gpu_node_resources = util.get_gpu_node_resources(include_id=True, include_gb_ram=True)
if any([args.gpu_per_worker, args.cpu_per_worker, args.ram_gb_per_worker]) and args.num_workers:
raise ValueError("Either specify only num_workers or only granular resources(GPU,CPU,RAM_GB).")
num_nodes = len(gpu_node_resources)
# Populate arguments
formatted_node_resources = {
"gpu_per_worker": [gpu_node_resources[i]["GPU"] for i in range(num_nodes)],
"cpu_per_worker": [gpu_node_resources[i]["CPU"] for i in range(num_nodes)],
"ram_gb_per_worker": [gpu_node_resources[i]["ram_gb"] for i in range(num_nodes)],
"num_workers": args.num_workers, # By default, 1 worker por node
}
args = util.fill_in_missing_resources(args, resources=formatted_node_resources, policy=min)
print(f"[INFO]: Number of GPU nodes found: {num_nodes}")
if args.test:
jobs = ["nvidia-smi"] * num_nodes
for i, job in enumerate(jobs):
gpu_node = gpu_node_resources[i % num_nodes]
print(f"[INFO]: Submitting job {i + 1} of {len(jobs)} with job '{job}' to node {gpu_node}")
print(
f"[INFO]: Resource parameters: GPU: {args.gpu_per_worker[i]}"
f" CPU: {args.cpu_per_worker[i]} RAM {args.ram_gb_per_worker[i]}"
)
print(f"[INFO] For the node parameters, creating {args.num_workers[i]} workers")
num_gpus = args.gpu_per_worker[i] / args.num_workers[i]
num_cpus = args.cpu_per_worker[i] / args.num_workers[i]
memory = (args.ram_gb_per_worker[i] * 1024**3) / args.num_workers[i]
print(f"[INFO]: Requesting {num_gpus=} {num_cpus=} {memory=} id={gpu_node['id']}")
job = util.remote_execute_job.options(
num_gpus=num_gpus,
num_cpus=num_cpus,
memory=memory,
scheduling_strategy=NodeAffinitySchedulingStrategy(gpu_node["id"], soft=False),
).remote(job, f"Job {i}", args.test)
job_results.append(job)
results = ray.get(job_results)
for i, result in enumerate(results):
print(f"[INFO]: Job {i} result: {result}")
print("[INFO]: All jobs completed.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Submit multiple jobs with optional GPU testing.")
parser = util.add_resource_arguments(arg_parser=parser)
parser.add_argument("--ray_address", type=str, default="auto", help="the Ray address.")
parser.add_argument(
"--test",
action="store_true",
help=(
"Run nvidia-smi test instead of the arbitrary job,"
"can use as a sanity check prior to any jobs to check "
"that GPU resources are correctly isolated."
),
)
parser.add_argument(
"--sub_jobs",
type=str,
nargs=argparse.REMAINDER,
help="This should be last wrapper argument. Jobs separated by the + delimiter to run on a cluster.",
)
args = parser.parse_args()
if args.sub_jobs is not None:
jobs = " ".join(args.sub_jobs)
formatted_jobs = jobs.split("+")
else:
formatted_jobs = []
print(f"[INFO]: Isaac Ray Wrapper received jobs {formatted_jobs=}")
wrap_resources_to_jobs(jobs=formatted_jobs, args=args)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment