Adds YAML Resource Specification To Ray Integration (#2847)

# Description  This PR: - Add task_runner.py to support specifying resources, py_modules, and pip. Fixes [# (issue)](https://github.com/isaac-sim/IsaacLab/issues/2632)  ## Type of change  - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: garylvov <67614381+garylvov@users.noreply.github.com> Co-authored-by: 松翊 <songyi.wb@alibaba-inc.com> Co-authored-by: garylvov <67614381+garylvov@users.noreply.github.com>

Adds YAML Resource Specification To Ray Integration (#2847)
# Description  This PR: - Add task_runner.py to support specifying resources, py_modules, and pip. Fixes [# (issue)](https://github.com/isaac-sim/IsaacLab/issues/2632)  ## Type of change  - New feature (non-breaking change which adds functionality) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there  --------- Signed-off-by: garylvov <67614381+garylvov@users.noreply.github.com> Co-authored-by: 松翊 <songyi.wb@alibaba-inc.com> Co-authored-by: garylvov <67614381+garylvov@users.noreply.github.com>
dddd51db · Willbon · GitHub · 82b24dd4 · dddd51db · dddd51db
Unverified Commit dddd51db authored Sep 02, 2025 by Willbon Committed by GitHub Sep 02, 2025
6 changed files
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -140,6 +140,7 @@ Guidelines for modifications:
 * Ziqi Fan
 * Zoe McCarthy
 * David Leon
+* Song Yi
 ## Acknowledgements

--- a/docs/source/features/ray.rst
+++ b/docs/source/features/ray.rst
@@ -44,22 +44,28 @@ specifying the ``--num_workers`` argument for resource-wrapped jobs, or ``--num_
 for tuning jobs, which is especially critical for parallel aggregate
 job processing on local/virtual multi-GPU machines. Tuning jobs assume homogeneous node resource composition for nodes with GPUs.
-The two following files contain the core functionality of the Ray integration.
+The three following files contain the core functionality of the Ray integration.
 .. dropdown:: scripts/reinforcement_learning/ray/wrap_resources.py
  :icon: code
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py
    :language: python
-    :emphasize-lines: 14-66
+    :emphasize-lines: 10-63
 .. dropdown:: scripts/reinforcement_learning/ray/tuner.py
  :icon: code
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/tuner.py
    :language: python
-    :emphasize-lines: 18-53
+    :emphasize-lines: 18-54
+.. dropdown:: scripts/reinforcement_learning/ray/task_runner.py
+  :icon: code
+  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/task_runner.py
+    :language: python
+    :emphasize-lines: 13-105
 The following script can be used to submit aggregate
 jobs to one or more Ray cluster(s), which can be used for
@@ -71,7 +77,7 @@ resource requirements.
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py
    :language: python
-    :emphasize-lines: 12-53
+    :emphasize-lines: 13-61
 The following script can be used to extract KubeRay cluster information for aggregate job submission.
@@ -89,7 +95,7 @@ The following script can be used to easily create clusters on Google GKE.
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py
    :language: python
-    :emphasize-lines: 16-37
+    :emphasize-lines: 15-36
 Docker-based Local Quickstart
 -----------------------------
@@ -147,7 +153,26 @@ Submitting resource-wrapped individual jobs instead of automatic tuning runs is
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py
    :language: python
-    :emphasize-lines: 14-66
+    :emphasize-lines: 10-63
+The ``task_runner.py`` dispatches Python tasks to a Ray cluster via a single declarative YAML file. This approach allows users to specify additional pip packages and Python modules for each run. Fine-grained resource allocation is supported, with explicit control over the number of CPUs, GPUs, and memory assigned to each task. The runner also offers advanced scheduling capabilities: tasks can be restricted to specific nodes by hostname or node ID, and supports two launch modes: tasks can be executed independently as resources become available, or grouped into a simultaneous batch—ideal for multi-node training jobs—which ensures that all tasks launch together only when sufficient resources are available across the cluster.
+.. dropdown:: scripts/reinforcement_learning/ray/task_runner.py
+  :icon: code
+  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/task_runner.py
+    :language: python
+    :emphasize-lines: 13-105
+To use this script, run a command similar to the following (replace ``tasks.yaml`` with your actual configuration file):
+.. code-block:: bash
+  python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs task_runner.py --task_cfg tasks.yaml
+For detailed instructions on how to write your ``tasks.yaml`` file, please refer to the comments in ``task_runner.py``.
+**Tip:** Place the ``tasks.yaml`` file in the ``scripts/reinforcement_learning/ray`` directory so that it is included when the ``working_dir`` is uploaded. You can then reference it using a relative path in the command.
 Transferring files from the running container can be done as follows.
@@ -288,7 +313,7 @@ where instructions are included in the following creation file.
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py
    :language: python
-    :emphasize-lines: 15-37
+    :emphasize-lines: 15-36
 For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
 Google's.
@@ -345,7 +370,7 @@ Dispatching Steps Shared Between KubeRay and Pure Ray Part II
  .. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py
    :language: python
-    :emphasize-lines: 12-53
+    :emphasize-lines: 13-61
 3.) For tuning jobs, specify the tuning job / hyperparameter sweep as a :class:`JobCfg` .
 The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in

--- a/scripts/reinforcement_learning/ray/submit_job.py
+++ b/scripts/reinforcement_learning/ray/submit_job.py
@@ -3,13 +3,6 @@
 #
 # SPDX-License-Identifier: BSD-3-Clause
-import argparse
-import os
-import time
-from concurrent.futures import ThreadPoolExecutor
-from ray import job_submission
 """
 This script submits aggregate job(s) to cluster(s) described in a
 config file containing ``name: <NAME> address: http://<IP>:<PORT>`` on
@@ -26,7 +19,11 @@ An aggregate job could be a :file:`../tuner.py` tuning job, which automatically
 creates several individual jobs when started on a cluster. Alternatively, an aggregate job
 could be a :file:'../wrap_resources.py` resource-wrapped job,
 which may contain several individual sub-jobs separated by
-the + delimiter.
+the + delimiter. An aggregate job could also be a :file:`../task_runner.py` multi-task submission job,
+where each sub-job and its resource requirements are defined in a YAML configuration file.
+In this mode, :file:`../task_runner.py` will read the YAML file (via --task_cfg), and
+submit all defined sub-tasks to the Ray cluster, supporting per-job resource specification and
+real-time streaming of sub-job outputs.
 If there are more aggregate jobs than cluster(s), aggregate jobs will be submitted
 as clusters become available via the defined relation above. If there are less aggregate job(s)
@@ -48,9 +45,21 @@ Usage:
    # Example: Submitting resource wrapped job
    python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs wrap_resources.py --test
+    # Example: submitting tasks with specific resources, and supporting pip packages and py_modules
+    # You may use relative paths for task_cfg and py_modules, placing them in the scripts/reinforcement_learning/ray directory, which will be uploaded to the cluster.
+    python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs task_runner.py --task_cfg tasks.yaml
    # For all command line arguments
    python3 scripts/reinforcement_learning/ray/submit_job.py -h
 """
+import argparse
+import os
+import time
+from concurrent.futures import ThreadPoolExecutor
+from ray import job_submission
 script_directory = os.path.dirname(os.path.abspath(__file__))
 CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"}

--- a/scripts/reinforcement_learning/ray/task_runner.py
+++ b/scripts/reinforcement_learning/ray/task_runner.py
+# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+"""
+This script dispatches one or more user-defined Python tasks to workers in a Ray cluster.
+Each task, along with its resource requirements and execution parameters, is specified in a YAML configuration file.
+Users may define the number of CPUs, GPUs, and the amount of memory to allocate per task via the config file.
+Key features:
+-------------
+- Fine-grained, per-task resource management via config fields (`num_gpus`, `num_cpus`, `memory`).
+- Parallel execution of multiple tasks using available resources across the Ray cluster.
+- Option to specify node affinity for tasks, e.g., by hostname, node ID, or any node.
+- Optional batch (simultaneous) or independent scheduling of tasks.
+Task scheduling and distribution are handled via Ray’s built-in resource manager.
+YAML configuration fields:
+--------------------------
+- `pip`: List of extra pip packages to install before running any tasks.
+- `py_modules`: List of additional Python module paths (directories or files) to include in the runtime environment.
+- `concurrent`: (bool) It determines task dispatch semantics:
+    - If `concurrent: true`, **all tasks are scheduled as a batch**. The script waits until sufficient resources are available for every task in the batch, then launches all tasks together. If resources are insufficient, all tasks remain blocked until the cluster can support the full batch.
+    - If `concurrent: false`, tasks are launched as soon as resources are available for each individual task, and Ray independently schedules them. This may result in non-simultaneous task start times.
+- `tasks`: List of task specifications, each with:
+    - `name`: String identifier for the task.
+    - `py_args`: Arguments to the Python interpreter (e.g., script/module, flags, user arguments).
+    - `num_gpus`: Number of GPUs to allocate (float or string arithmetic, e.g., "2*2").
+    - `num_cpus`: Number of CPUs to allocate (float or string).
+    - `memory`: Amount of RAM in bytes (int or string).
+    - `node` (optional): Node placement constraints.
+        - `specific` (str): Type of node placement, support `hostname`, `node_id`, or `any`.
+            - `any`: Place the task on any available node.
+            - `hostname`: Place the task on a specific hostname. `hostname` must be specified in the node field.
+            - `node_id`: Place the task on a specific node ID. `node_id` must be specified in the node field.
+        - `hostname` (str): Specific hostname to place the task on.
+        - `node_id` (str): Specific node ID to place the task on.
+Typical usage:
+---------------
+.. code-block:: bash
+    # Print help and argument details:
+    python task_runner.py -h
+    # Submit tasks defined in a YAML file to the Ray cluster (auto-detects Ray head address):
+    python task_runner.py --task_cfg /path/to/tasks.yaml
+YAML configuration example-1:
+---------------------------
+.. code-block:: yaml
+    pip: ["xxx"]
+    py_modules: ["my_package/my_package"]
+    concurrent: false
+    tasks:
+      - name: "Isaac-Cartpole-v0"
+        py_args: "-m torch.distributed.run --nnodes=1 --nproc_per_node=2  --rdzv_endpoint=localhost:29501 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --max_iterations 200 --headless --distributed"
+        num_gpus: 2
+        num_cpus: 10
+        memory: 10737418240
+      - name: "script need some dependencies"
+        py_args: "script.py --option arg"
+        num_gpus: 0
+        num_cpus: 1
+        memory: 10*1024*1024*1024
+YAML configuration example-2:
+---------------------------
+.. code-block:: yaml
+    pip: ["xxx"]
+    py_modules: ["my_package/my_package"]
+    concurrent: true
+    tasks:
+    - name: "Isaac-Cartpole-v0-multi-node-train-1"
+        py_args: "-m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --max_iterations 1000"
+        num_gpus: 1
+        num_cpus: 10
+        memory: 10*1024*1024*1024
+        node:
+          specific: "hostname"
+          hostname: "xxx"
+    - name: "Isaac-Cartpole-v0-multi-node-train-2"
+        py_args: "-m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:5555 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --max_iterations 1000"
+        num_gpus: 1
+        num_cpus: 10
+        memory: 10*1024*1024*1024
+        node:
+          specific: "hostname"
+          hostname: "xxx"
+To stop all tasks early, press Ctrl+C; the script will cancel all running Ray tasks.
+"""
+import argparse
+import yaml
+from datetime import datetime
+import util
+def parse_args() -> argparse.Namespace:
+    """
+    Parse command-line arguments for the Ray task runner.
+    Returns:
+        argparse.Namespace: The namespace containing parsed CLI arguments:
+            - task_cfg (str): Path to the YAML task file.
+            - ray_address (str): Ray cluster address.
+            - test (bool): Whether to run a GPU resource isolation sanity check.
+    """
+    parser = argparse.ArgumentParser(description="Run tasks from a YAML config file.")
+    parser.add_argument("--task_cfg", type=str, required=True, help="Path to the YAML task file.")
+    parser.add_argument("--ray_address", type=str, default="auto", help="the Ray address.")
+    parser.add_argument(
+        "--test",
+        action="store_true",
+        help=(
+            "Run nvidia-smi test instead of the arbitrary job,"
+            "can use as a sanity check prior to any jobs to check "
+            "that GPU resources are correctly isolated."
+        ),
+    )
+    return parser.parse_args()
+def parse_task_resource(task: dict) -> util.JobResource:
+    """
+    Parse task resource requirements from the YAML configuration.
+    Args:
+        task (dict): Dictionary representing a single task's configuration.
+            Keys may include `num_gpus`, `num_cpus`, and `memory`, each either
+            as a number or evaluatable string expression.
+    Returns:
+        util.JobResource: Resource object with the parsed values.
+    """
+    resource = util.JobResource()
+    if "num_gpus" in task:
+        resource.num_gpus = eval(task["num_gpus"]) if isinstance(task["num_gpus"], str) else task["num_gpus"]
+    if "num_cpus" in task:
+        resource.num_cpus = eval(task["num_cpus"]) if isinstance(task["num_cpus"], str) else task["num_cpus"]
+    if "memory" in task:
+        resource.memory = eval(task["memory"]) if isinstance(task["memory"], str) else task["memory"]
+    return resource
+def run_tasks(
+    tasks: list[dict], args: argparse.Namespace, runtime_env: dict | None = None, concurrent: bool = False
+) -> None:
+    """
+    Submit tasks to the Ray cluster for execution.
+    Args:
+        tasks (list[dict]): A list of task configuration dictionaries.
+        args (argparse.Namespace): Parsed command-line arguments.
+        runtime_env (dict | None): Ray runtime environment configuration containing:
+            - pip (list[str] | None): Additional pip packages to install.
+            - py_modules (list[str] | None): Python modules to include in the environment.
+        concurrent (bool): Whether to launch tasks simultaneously as a batch,
+                           or independently as resources become available.
+    Returns:
+        None
+    """
+    job_objs = []
+    util.ray_init(ray_address=args.ray_address, runtime_env=runtime_env, log_to_driver=False)
+    for task in tasks:
+        resource = parse_task_resource(task)
+        print(f"[INFO] Creating job {task['name']} with resource={resource}")
+        job = util.Job(
+            name=task["name"],
+            py_args=task["py_args"],
+            resources=resource,
+            node=util.JobNode(
+                specific=task.get("node", {}).get("specific"),
+                hostname=task.get("node", {}).get("hostname"),
+                node_id=task.get("node", {}).get("node_id"),
+            ),
+        )
+        job_objs.append(job)
+    start = datetime.now()
+    print(f"[INFO] Creating {len(job_objs)} jobs at {start.strftime('%H:%M:%S.%f')} with runtime env={runtime_env}")
+    # submit jobs
+    util.submit_wrapped_jobs(
+        jobs=job_objs,
+        test_mode=args.test,
+        concurrent=concurrent,
+    )
+    end = datetime.now()
+    print(
+        f"[INFO] All jobs completed at {end.strftime('%H:%M:%S.%f')}, took {(end - start).total_seconds():.2f} seconds."
+    )
+def main() -> None:
+    """
+    Main entry point for the Ray task runner script.
+    Reads the YAML task configuration file, parses CLI arguments,
+    and dispatches tasks to the Ray cluster.
+    Returns:
+        None
+    """
+    args = parse_args()
+    with open(args.task_cfg) as f:
+        config = yaml.safe_load(f)
+    tasks = config["tasks"]
+    runtime_env = {
+        "pip": None if not config.get("pip") else config["pip"],
+        "py_modules": None if not config.get("py_modules") else config["py_modules"],
+    }
+    concurrent = config.get("concurrent", False)
+    run_tasks(
+        tasks=tasks,
+        args=args,
+        runtime_env=runtime_env,
+        concurrent=concurrent,
+    )
+if __name__ == "__main__":
+    main()
--- a/scripts/reinforcement_learning/ray/util.py
+++ b/scripts/reinforcement_learning/ray/util.py
--- a/scripts/reinforcement_learning/ray/wrap_resources.py
+++ b/scripts/reinforcement_learning/ray/wrap_resources.py
@@ -3,12 +3,6 @@
 #
 # SPDX-License-Identifier: BSD-3-Clause
-import argparse
-import ray
-import util
-from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
 """
 This script dispatches sub-job(s) (individual jobs, use :file:`tuner.py` for tuning jobs)
 to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate
@@ -64,6 +58,10 @@ Usage:
    ./isaaclab.sh -p scripts/reinforcement_learning/ray/wrap_resources.py -h
 """
+import argparse
+import util
 def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
    """
@@ -75,9 +73,14 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
        args: The arguments for resource allocation
    """
-    if not ray.is_initialized():
+    job_objs = []
-        ray.init(address=args.ray_address, log_to_driver=True)
+    util.ray_init(
-    job_results = []
+        ray_address=args.ray_address,
+        runtime_env={
+            "py_modules": None if not args.py_modules else args.py_modules,
+        },
+        log_to_driver=False,
+    )
    gpu_node_resources = util.get_gpu_node_resources(include_id=True, include_gb_ram=True)
    if any([args.gpu_per_worker, args.cpu_per_worker, args.ram_gb_per_worker]) and args.num_workers:
@@ -97,7 +100,7 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
        jobs = ["nvidia-smi"] * num_nodes
    for i, job in enumerate(jobs):
        gpu_node = gpu_node_resources[i % num_nodes]
-        print(f"[INFO]: Submitting job {i + 1} of {len(jobs)} with job '{job}' to node {gpu_node}")
+        print(f"[INFO]: Creating job {i + 1} of {len(jobs)} with job '{job}' to node {gpu_node}")
        print(
            f"[INFO]: Resource parameters: GPU: {args.gpu_per_worker[i]}"
            f" CPU: {args.cpu_per_worker[i]} RAM {args.ram_gb_per_worker[i]}"
@@ -106,19 +109,19 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
        num_gpus = args.gpu_per_worker[i] / args.num_workers[i]
        num_cpus = args.cpu_per_worker[i] / args.num_workers[i]
        memory = (args.ram_gb_per_worker[i] * 1024**3) / args.num_workers[i]
-        print(f"[INFO]: Requesting {num_gpus=} {num_cpus=} {memory=} id={gpu_node['id']}")
+        job_objs.append(
-        job = util.remote_execute_job.options(
+            util.Job(
-            num_gpus=num_gpus,
+                cmd=job,
-            num_cpus=num_cpus,
+                name=f"Job-{i + 1}",
-            memory=memory,
+                resources=util.JobResource(num_gpus=num_gpus, num_cpus=num_cpus, memory=memory),
-            scheduling_strategy=NodeAffinitySchedulingStrategy(gpu_node["id"], soft=False),
+                node=util.JobNode(
-        ).remote(job, f"Job {i}", args.test)
+                    specific="node_id",
-        job_results.append(job)
+                    node_id=gpu_node["id"],
+                ),
-    results = ray.get(job_results)
+            )
-    for i, result in enumerate(results):
+        )
-        print(f"[INFO]: Job {i} result: {result}")
+    # submit jobs
-    print("[INFO]: All jobs completed.")
+    util.submit_wrapped_jobs(jobs=job_objs, test_mode=args.test, concurrent=False)
 if __name__ == "__main__":
@@ -134,6 +137,15 @@ if __name__ == "__main__":
            "that GPU resources are correctly isolated."
        ),
    )
+    parser.add_argument(
+        "--py_modules",
+        type=str,
+        nargs="*",
+        default=[],
+        help=(
+            "List of python modules or paths to add before running the job. Example: --py_modules my_package/my_package"
+        ),
+    )
    parser.add_argument(
        "--sub_jobs",
        type=str,