Unverified Commit dddd51db authored by Willbon's avatar Willbon Committed by GitHub

Adds YAML Resource Specification To Ray Integration (#2847)

# Description

<!--
Thank you for your interest in sending a pull request. Please make sure
to check the contribution guidelines.

Link:
https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html
-->
This PR:
- Add task_runner.py to support specifying resources, py_modules, and
pip.

Fixes [# (issue)](https://github.com/isaac-sim/IsaacLab/issues/2632)

<!-- As a practice, it is recommended to open an issue to have
discussions on the proposed pull request.
This makes it easier for the community to keep track of what is being
developed or added, and if a given feature
is demanded by more than one party. -->

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- New feature (non-breaking change which adds functionality)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->

---------
Signed-off-by: 's avatargarylvov <67614381+garylvov@users.noreply.github.com>
Co-authored-by: 's avatar松翊 <songyi.wb@alibaba-inc.com>
Co-authored-by: 's avatargarylvov <67614381+garylvov@users.noreply.github.com>
parent 82b24dd4
...@@ -140,6 +140,7 @@ Guidelines for modifications: ...@@ -140,6 +140,7 @@ Guidelines for modifications:
* Ziqi Fan * Ziqi Fan
* Zoe McCarthy * Zoe McCarthy
* David Leon * David Leon
* Song Yi
## Acknowledgements ## Acknowledgements
......
...@@ -44,22 +44,28 @@ specifying the ``--num_workers`` argument for resource-wrapped jobs, or ``--num_ ...@@ -44,22 +44,28 @@ specifying the ``--num_workers`` argument for resource-wrapped jobs, or ``--num_
for tuning jobs, which is especially critical for parallel aggregate for tuning jobs, which is especially critical for parallel aggregate
job processing on local/virtual multi-GPU machines. Tuning jobs assume homogeneous node resource composition for nodes with GPUs. job processing on local/virtual multi-GPU machines. Tuning jobs assume homogeneous node resource composition for nodes with GPUs.
The two following files contain the core functionality of the Ray integration. The three following files contain the core functionality of the Ray integration.
.. dropdown:: scripts/reinforcement_learning/ray/wrap_resources.py .. dropdown:: scripts/reinforcement_learning/ray/wrap_resources.py
:icon: code :icon: code
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py
:language: python :language: python
:emphasize-lines: 14-66 :emphasize-lines: 10-63
.. dropdown:: scripts/reinforcement_learning/ray/tuner.py .. dropdown:: scripts/reinforcement_learning/ray/tuner.py
:icon: code :icon: code
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/tuner.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/tuner.py
:language: python :language: python
:emphasize-lines: 18-53 :emphasize-lines: 18-54
.. dropdown:: scripts/reinforcement_learning/ray/task_runner.py
:icon: code
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/task_runner.py
:language: python
:emphasize-lines: 13-105
The following script can be used to submit aggregate The following script can be used to submit aggregate
jobs to one or more Ray cluster(s), which can be used for jobs to one or more Ray cluster(s), which can be used for
...@@ -71,7 +77,7 @@ resource requirements. ...@@ -71,7 +77,7 @@ resource requirements.
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py
:language: python :language: python
:emphasize-lines: 12-53 :emphasize-lines: 13-61
The following script can be used to extract KubeRay cluster information for aggregate job submission. The following script can be used to extract KubeRay cluster information for aggregate job submission.
...@@ -89,7 +95,7 @@ The following script can be used to easily create clusters on Google GKE. ...@@ -89,7 +95,7 @@ The following script can be used to easily create clusters on Google GKE.
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py
:language: python :language: python
:emphasize-lines: 16-37 :emphasize-lines: 15-36
Docker-based Local Quickstart Docker-based Local Quickstart
----------------------------- -----------------------------
...@@ -147,7 +153,26 @@ Submitting resource-wrapped individual jobs instead of automatic tuning runs is ...@@ -147,7 +153,26 @@ Submitting resource-wrapped individual jobs instead of automatic tuning runs is
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/wrap_resources.py
:language: python :language: python
:emphasize-lines: 14-66 :emphasize-lines: 10-63
The ``task_runner.py`` dispatches Python tasks to a Ray cluster via a single declarative YAML file. This approach allows users to specify additional pip packages and Python modules for each run. Fine-grained resource allocation is supported, with explicit control over the number of CPUs, GPUs, and memory assigned to each task. The runner also offers advanced scheduling capabilities: tasks can be restricted to specific nodes by hostname or node ID, and supports two launch modes: tasks can be executed independently as resources become available, or grouped into a simultaneous batch—ideal for multi-node training jobs—which ensures that all tasks launch together only when sufficient resources are available across the cluster.
.. dropdown:: scripts/reinforcement_learning/ray/task_runner.py
:icon: code
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/task_runner.py
:language: python
:emphasize-lines: 13-105
To use this script, run a command similar to the following (replace ``tasks.yaml`` with your actual configuration file):
.. code-block:: bash
python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs task_runner.py --task_cfg tasks.yaml
For detailed instructions on how to write your ``tasks.yaml`` file, please refer to the comments in ``task_runner.py``.
**Tip:** Place the ``tasks.yaml`` file in the ``scripts/reinforcement_learning/ray`` directory so that it is included when the ``working_dir`` is uploaded. You can then reference it using a relative path in the command.
Transferring files from the running container can be done as follows. Transferring files from the running container can be done as follows.
...@@ -288,7 +313,7 @@ where instructions are included in the following creation file. ...@@ -288,7 +313,7 @@ where instructions are included in the following creation file.
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/launch.py
:language: python :language: python
:emphasize-lines: 15-37 :emphasize-lines: 15-36
For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of For other cloud services, the ``kuberay.yaml.ninja`` will be similar to that of
Google's. Google's.
...@@ -345,7 +370,7 @@ Dispatching Steps Shared Between KubeRay and Pure Ray Part II ...@@ -345,7 +370,7 @@ Dispatching Steps Shared Between KubeRay and Pure Ray Part II
.. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py .. literalinclude:: ../../../scripts/reinforcement_learning/ray/submit_job.py
:language: python :language: python
:emphasize-lines: 12-53 :emphasize-lines: 13-61
3.) For tuning jobs, specify the tuning job / hyperparameter sweep as a :class:`JobCfg` . 3.) For tuning jobs, specify the tuning job / hyperparameter sweep as a :class:`JobCfg` .
The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in The included :class:`JobCfg` only supports the ``rl_games`` workflow due to differences in
......
...@@ -3,13 +3,6 @@ ...@@ -3,13 +3,6 @@
# #
# SPDX-License-Identifier: BSD-3-Clause # SPDX-License-Identifier: BSD-3-Clause
import argparse
import os
import time
from concurrent.futures import ThreadPoolExecutor
from ray import job_submission
""" """
This script submits aggregate job(s) to cluster(s) described in a This script submits aggregate job(s) to cluster(s) described in a
config file containing ``name: <NAME> address: http://<IP>:<PORT>`` on config file containing ``name: <NAME> address: http://<IP>:<PORT>`` on
...@@ -26,7 +19,11 @@ An aggregate job could be a :file:`../tuner.py` tuning job, which automatically ...@@ -26,7 +19,11 @@ An aggregate job could be a :file:`../tuner.py` tuning job, which automatically
creates several individual jobs when started on a cluster. Alternatively, an aggregate job creates several individual jobs when started on a cluster. Alternatively, an aggregate job
could be a :file:'../wrap_resources.py` resource-wrapped job, could be a :file:'../wrap_resources.py` resource-wrapped job,
which may contain several individual sub-jobs separated by which may contain several individual sub-jobs separated by
the + delimiter. the + delimiter. An aggregate job could also be a :file:`../task_runner.py` multi-task submission job,
where each sub-job and its resource requirements are defined in a YAML configuration file.
In this mode, :file:`../task_runner.py` will read the YAML file (via --task_cfg), and
submit all defined sub-tasks to the Ray cluster, supporting per-job resource specification and
real-time streaming of sub-job outputs.
If there are more aggregate jobs than cluster(s), aggregate jobs will be submitted If there are more aggregate jobs than cluster(s), aggregate jobs will be submitted
as clusters become available via the defined relation above. If there are less aggregate job(s) as clusters become available via the defined relation above. If there are less aggregate job(s)
...@@ -48,9 +45,21 @@ Usage: ...@@ -48,9 +45,21 @@ Usage:
# Example: Submitting resource wrapped job # Example: Submitting resource wrapped job
python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs wrap_resources.py --test python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs wrap_resources.py --test
# Example: submitting tasks with specific resources, and supporting pip packages and py_modules
# You may use relative paths for task_cfg and py_modules, placing them in the scripts/reinforcement_learning/ray directory, which will be uploaded to the cluster.
python3 scripts/reinforcement_learning/ray/submit_job.py --aggregate_jobs task_runner.py --task_cfg tasks.yaml
# For all command line arguments # For all command line arguments
python3 scripts/reinforcement_learning/ray/submit_job.py -h python3 scripts/reinforcement_learning/ray/submit_job.py -h
""" """
import argparse
import os
import time
from concurrent.futures import ThreadPoolExecutor
from ray import job_submission
script_directory = os.path.dirname(os.path.abspath(__file__)) script_directory = os.path.dirname(os.path.abspath(__file__))
CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"} CONFIG = {"working_dir": script_directory, "executable": "/workspace/isaaclab/isaaclab.sh -p"}
......
# Copyright (c) 2022-2025, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
"""
This script dispatches one or more user-defined Python tasks to workers in a Ray cluster.
Each task, along with its resource requirements and execution parameters, is specified in a YAML configuration file.
Users may define the number of CPUs, GPUs, and the amount of memory to allocate per task via the config file.
Key features:
-------------
- Fine-grained, per-task resource management via config fields (`num_gpus`, `num_cpus`, `memory`).
- Parallel execution of multiple tasks using available resources across the Ray cluster.
- Option to specify node affinity for tasks, e.g., by hostname, node ID, or any node.
- Optional batch (simultaneous) or independent scheduling of tasks.
Task scheduling and distribution are handled via Ray’s built-in resource manager.
YAML configuration fields:
--------------------------
- `pip`: List of extra pip packages to install before running any tasks.
- `py_modules`: List of additional Python module paths (directories or files) to include in the runtime environment.
- `concurrent`: (bool) It determines task dispatch semantics:
- If `concurrent: true`, **all tasks are scheduled as a batch**. The script waits until sufficient resources are available for every task in the batch, then launches all tasks together. If resources are insufficient, all tasks remain blocked until the cluster can support the full batch.
- If `concurrent: false`, tasks are launched as soon as resources are available for each individual task, and Ray independently schedules them. This may result in non-simultaneous task start times.
- `tasks`: List of task specifications, each with:
- `name`: String identifier for the task.
- `py_args`: Arguments to the Python interpreter (e.g., script/module, flags, user arguments).
- `num_gpus`: Number of GPUs to allocate (float or string arithmetic, e.g., "2*2").
- `num_cpus`: Number of CPUs to allocate (float or string).
- `memory`: Amount of RAM in bytes (int or string).
- `node` (optional): Node placement constraints.
- `specific` (str): Type of node placement, support `hostname`, `node_id`, or `any`.
- `any`: Place the task on any available node.
- `hostname`: Place the task on a specific hostname. `hostname` must be specified in the node field.
- `node_id`: Place the task on a specific node ID. `node_id` must be specified in the node field.
- `hostname` (str): Specific hostname to place the task on.
- `node_id` (str): Specific node ID to place the task on.
Typical usage:
---------------
.. code-block:: bash
# Print help and argument details:
python task_runner.py -h
# Submit tasks defined in a YAML file to the Ray cluster (auto-detects Ray head address):
python task_runner.py --task_cfg /path/to/tasks.yaml
YAML configuration example-1:
---------------------------
.. code-block:: yaml
pip: ["xxx"]
py_modules: ["my_package/my_package"]
concurrent: false
tasks:
- name: "Isaac-Cartpole-v0"
py_args: "-m torch.distributed.run --nnodes=1 --nproc_per_node=2 --rdzv_endpoint=localhost:29501 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --max_iterations 200 --headless --distributed"
num_gpus: 2
num_cpus: 10
memory: 10737418240
- name: "script need some dependencies"
py_args: "script.py --option arg"
num_gpus: 0
num_cpus: 1
memory: 10*1024*1024*1024
YAML configuration example-2:
---------------------------
.. code-block:: yaml
pip: ["xxx"]
py_modules: ["my_package/my_package"]
concurrent: true
tasks:
- name: "Isaac-Cartpole-v0-multi-node-train-1"
py_args: "-m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --max_iterations 1000"
num_gpus: 1
num_cpus: 10
memory: 10*1024*1024*1024
node:
specific: "hostname"
hostname: "xxx"
- name: "Isaac-Cartpole-v0-multi-node-train-2"
py_args: "-m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=1 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:5555 /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --max_iterations 1000"
num_gpus: 1
num_cpus: 10
memory: 10*1024*1024*1024
node:
specific: "hostname"
hostname: "xxx"
To stop all tasks early, press Ctrl+C; the script will cancel all running Ray tasks.
"""
import argparse
import yaml
from datetime import datetime
import util
def parse_args() -> argparse.Namespace:
"""
Parse command-line arguments for the Ray task runner.
Returns:
argparse.Namespace: The namespace containing parsed CLI arguments:
- task_cfg (str): Path to the YAML task file.
- ray_address (str): Ray cluster address.
- test (bool): Whether to run a GPU resource isolation sanity check.
"""
parser = argparse.ArgumentParser(description="Run tasks from a YAML config file.")
parser.add_argument("--task_cfg", type=str, required=True, help="Path to the YAML task file.")
parser.add_argument("--ray_address", type=str, default="auto", help="the Ray address.")
parser.add_argument(
"--test",
action="store_true",
help=(
"Run nvidia-smi test instead of the arbitrary job,"
"can use as a sanity check prior to any jobs to check "
"that GPU resources are correctly isolated."
),
)
return parser.parse_args()
def parse_task_resource(task: dict) -> util.JobResource:
"""
Parse task resource requirements from the YAML configuration.
Args:
task (dict): Dictionary representing a single task's configuration.
Keys may include `num_gpus`, `num_cpus`, and `memory`, each either
as a number or evaluatable string expression.
Returns:
util.JobResource: Resource object with the parsed values.
"""
resource = util.JobResource()
if "num_gpus" in task:
resource.num_gpus = eval(task["num_gpus"]) if isinstance(task["num_gpus"], str) else task["num_gpus"]
if "num_cpus" in task:
resource.num_cpus = eval(task["num_cpus"]) if isinstance(task["num_cpus"], str) else task["num_cpus"]
if "memory" in task:
resource.memory = eval(task["memory"]) if isinstance(task["memory"], str) else task["memory"]
return resource
def run_tasks(
tasks: list[dict], args: argparse.Namespace, runtime_env: dict | None = None, concurrent: bool = False
) -> None:
"""
Submit tasks to the Ray cluster for execution.
Args:
tasks (list[dict]): A list of task configuration dictionaries.
args (argparse.Namespace): Parsed command-line arguments.
runtime_env (dict | None): Ray runtime environment configuration containing:
- pip (list[str] | None): Additional pip packages to install.
- py_modules (list[str] | None): Python modules to include in the environment.
concurrent (bool): Whether to launch tasks simultaneously as a batch,
or independently as resources become available.
Returns:
None
"""
job_objs = []
util.ray_init(ray_address=args.ray_address, runtime_env=runtime_env, log_to_driver=False)
for task in tasks:
resource = parse_task_resource(task)
print(f"[INFO] Creating job {task['name']} with resource={resource}")
job = util.Job(
name=task["name"],
py_args=task["py_args"],
resources=resource,
node=util.JobNode(
specific=task.get("node", {}).get("specific"),
hostname=task.get("node", {}).get("hostname"),
node_id=task.get("node", {}).get("node_id"),
),
)
job_objs.append(job)
start = datetime.now()
print(f"[INFO] Creating {len(job_objs)} jobs at {start.strftime('%H:%M:%S.%f')} with runtime env={runtime_env}")
# submit jobs
util.submit_wrapped_jobs(
jobs=job_objs,
test_mode=args.test,
concurrent=concurrent,
)
end = datetime.now()
print(
f"[INFO] All jobs completed at {end.strftime('%H:%M:%S.%f')}, took {(end - start).total_seconds():.2f} seconds."
)
def main() -> None:
"""
Main entry point for the Ray task runner script.
Reads the YAML task configuration file, parses CLI arguments,
and dispatches tasks to the Ray cluster.
Returns:
None
"""
args = parse_args()
with open(args.task_cfg) as f:
config = yaml.safe_load(f)
tasks = config["tasks"]
runtime_env = {
"pip": None if not config.get("pip") else config["pip"],
"py_modules": None if not config.get("py_modules") else config["py_modules"],
}
concurrent = config.get("concurrent", False)
run_tasks(
tasks=tasks,
args=args,
runtime_env=runtime_env,
concurrent=concurrent,
)
if __name__ == "__main__":
main()
This diff is collapsed.
...@@ -3,12 +3,6 @@ ...@@ -3,12 +3,6 @@
# #
# SPDX-License-Identifier: BSD-3-Clause # SPDX-License-Identifier: BSD-3-Clause
import argparse
import ray
import util
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
""" """
This script dispatches sub-job(s) (individual jobs, use :file:`tuner.py` for tuning jobs) This script dispatches sub-job(s) (individual jobs, use :file:`tuner.py` for tuning jobs)
to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate to worker(s) on GPU-enabled node(s) of a specific cluster as part of an resource-wrapped aggregate
...@@ -64,6 +58,10 @@ Usage: ...@@ -64,6 +58,10 @@ Usage:
./isaaclab.sh -p scripts/reinforcement_learning/ray/wrap_resources.py -h ./isaaclab.sh -p scripts/reinforcement_learning/ray/wrap_resources.py -h
""" """
import argparse
import util
def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None: def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
""" """
...@@ -75,9 +73,14 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None: ...@@ -75,9 +73,14 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
args: The arguments for resource allocation args: The arguments for resource allocation
""" """
if not ray.is_initialized(): job_objs = []
ray.init(address=args.ray_address, log_to_driver=True) util.ray_init(
job_results = [] ray_address=args.ray_address,
runtime_env={
"py_modules": None if not args.py_modules else args.py_modules,
},
log_to_driver=False,
)
gpu_node_resources = util.get_gpu_node_resources(include_id=True, include_gb_ram=True) gpu_node_resources = util.get_gpu_node_resources(include_id=True, include_gb_ram=True)
if any([args.gpu_per_worker, args.cpu_per_worker, args.ram_gb_per_worker]) and args.num_workers: if any([args.gpu_per_worker, args.cpu_per_worker, args.ram_gb_per_worker]) and args.num_workers:
...@@ -97,7 +100,7 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None: ...@@ -97,7 +100,7 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
jobs = ["nvidia-smi"] * num_nodes jobs = ["nvidia-smi"] * num_nodes
for i, job in enumerate(jobs): for i, job in enumerate(jobs):
gpu_node = gpu_node_resources[i % num_nodes] gpu_node = gpu_node_resources[i % num_nodes]
print(f"[INFO]: Submitting job {i + 1} of {len(jobs)} with job '{job}' to node {gpu_node}") print(f"[INFO]: Creating job {i + 1} of {len(jobs)} with job '{job}' to node {gpu_node}")
print( print(
f"[INFO]: Resource parameters: GPU: {args.gpu_per_worker[i]}" f"[INFO]: Resource parameters: GPU: {args.gpu_per_worker[i]}"
f" CPU: {args.cpu_per_worker[i]} RAM {args.ram_gb_per_worker[i]}" f" CPU: {args.cpu_per_worker[i]} RAM {args.ram_gb_per_worker[i]}"
...@@ -106,19 +109,19 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None: ...@@ -106,19 +109,19 @@ def wrap_resources_to_jobs(jobs: list[str], args: argparse.Namespace) -> None:
num_gpus = args.gpu_per_worker[i] / args.num_workers[i] num_gpus = args.gpu_per_worker[i] / args.num_workers[i]
num_cpus = args.cpu_per_worker[i] / args.num_workers[i] num_cpus = args.cpu_per_worker[i] / args.num_workers[i]
memory = (args.ram_gb_per_worker[i] * 1024**3) / args.num_workers[i] memory = (args.ram_gb_per_worker[i] * 1024**3) / args.num_workers[i]
print(f"[INFO]: Requesting {num_gpus=} {num_cpus=} {memory=} id={gpu_node['id']}") job_objs.append(
job = util.remote_execute_job.options( util.Job(
num_gpus=num_gpus, cmd=job,
num_cpus=num_cpus, name=f"Job-{i + 1}",
memory=memory, resources=util.JobResource(num_gpus=num_gpus, num_cpus=num_cpus, memory=memory),
scheduling_strategy=NodeAffinitySchedulingStrategy(gpu_node["id"], soft=False), node=util.JobNode(
).remote(job, f"Job {i}", args.test) specific="node_id",
job_results.append(job) node_id=gpu_node["id"],
),
results = ray.get(job_results) )
for i, result in enumerate(results): )
print(f"[INFO]: Job {i} result: {result}") # submit jobs
print("[INFO]: All jobs completed.") util.submit_wrapped_jobs(jobs=job_objs, test_mode=args.test, concurrent=False)
if __name__ == "__main__": if __name__ == "__main__":
...@@ -134,6 +137,15 @@ if __name__ == "__main__": ...@@ -134,6 +137,15 @@ if __name__ == "__main__":
"that GPU resources are correctly isolated." "that GPU resources are correctly isolated."
), ),
) )
parser.add_argument(
"--py_modules",
type=str,
nargs="*",
default=[],
help=(
"List of python modules or paths to add before running the job. Example: --py_modules my_package/my_package"
),
)
parser.add_argument( parser.add_argument(
"--sub_jobs", "--sub_jobs",
type=str, type=str,
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment