Unverified Commit 08ebac7a authored by Shafeef Omar's avatar Shafeef Omar Committed by GitHub

Adds support for PBS job scheduler based clusters (#605)

# Description

This pull request adds support for running IsaacLab on clusters that use
PBS job schedulers (e.g. Franklin@IIT). Currently, it only supports
SLURM. The job submission scripts have been modified to choose between
SLURM or PBS job schedulers. The user can opt for the required job
scheduler from `docker/cluster/.env.base` under cluster specific
settings.

Tested successfully on Franklin@IIT HPC.

Fixes #599 

## Type of change

- New feature (non-breaking change which adds functionality)
- This change requires a documentation update

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have run all the tests with `./isaaclab.sh --test` and they pass
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
parent 455a1748
......@@ -17,6 +17,9 @@ DOCKER_USER_HOME=/root
# Cluster specific settings
###
# Job scheduler used by cluster.
# Currently supports PBS and SLURM
CLUSTER_JOB_SCHEDULER=SLURM
# Docker cache dir for Isaac Sim (has to end on docker-isaac-sim)
# e.g. /cluster/scratch/$USER/docker-isaac-sim
CLUSTER_ISAAC_SIM_CACHE_DIR=/some/path/on/cluster/docker-isaac-sim
......
#!/usr/bin/env bash
# in the case you need to load specific modules on the cluster, add them here
# e.g., `module load eth_proxy`
# create job script with compute demands
### MODIFY HERE FOR YOUR JOB ###
cat <<EOT > job.sh
#!/bin/bash
#PBS -l select=1:ncpus=8:mpiprocs=1:ngpus=1
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -q gpu
#PBS -N isaaclab
#PBS -m bea -M "user@mail"
# Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
sh "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}"
EOT
qsub job.sh
rm job.sh
......@@ -231,6 +231,28 @@ x11_cleanup() {
fi
}
submit_job() {
echo "[INFO] Arguments passed to job script ${@}"
case $CLUSTER_JOB_SCHEDULER in
"SLURM")
CMD=sbatch
job_script_file=submit_job_slurm.sh
;;
"PBS")
CMD=bash
job_script_file=submit_job_pbs.sh
;;
*)
echo "[ERROR] Unsupported job scheduler specified: '$CLUSTER_JOB_SCHEDULER'. Supported options are: ['SLURM', 'PBS']"
exit 1
;;
esac
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && $CMD $CLUSTER_ISAACLAB_DIR/docker/cluster/$job_script_file \"$CLUSTER_ISAACLAB_DIR\" \"isaac-lab-$container_profile\" ${@}"
}
#==
# Main
#==
......@@ -372,12 +394,10 @@ case $mode in
# check whether the second argument is a profile or a job argument
if [ "$profile_arg" == "$container_profile" ] ; then
# if the second argument is a profile, we have to shift the arguments
echo "[INFO] Arguments passed to job script ${@:3}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && sbatch $CLUSTER_ISAACLAB_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ISAACLAB_DIR" "isaac-lab-$container_profile" "${@:3}"
submit_job "${@:3}"
else
# if the second argument is a job argument, we have to shift only one argument
echo "[INFO] Arguments passed to job script ${@:2}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && sbatch $CLUSTER_ISAACLAB_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ISAACLAB_DIR" "isaac-lab-$container_profile" "${@:2}"
submit_job "${@:2}"
fi
;;
config)
......
......@@ -17,7 +17,8 @@ convert the Isaac Lab Docker image into a singularity image and use it to submit
.. attention::
Cluster setup varies across different institutions. The following instructions have been
tested on the `ETH Zurich Euler`_ cluster, which uses the SLURM workload manager.
tested on the `ETH Zurich Euler`_ cluster (which uses the SLURM workload manager), and the
IIT Genoa Franklin cluster (which uses PBS workload manager).
The instructions may need to be adapted for other clusters. If you have successfully
adapted the instructions for another cluster, please consider contributing to the
......@@ -59,7 +60,9 @@ Configuring the cluster parameters
First, you need to configure the cluster-specific parameters in ``docker/.env.base`` file.
The following describes the parameters that need to be configured:
- ``CLUSTER_JOB_SCHEDULER``:
The job scheduler/workload manager used by your cluster. Currently, we support SLURM and
PBS workload managers [SLURM | PBS].
- ``CLUSTER_ISAAC_SIM_CACHE_DIR``:
The directory on the cluster where the Isaac Sim cache is stored. This directory
has to end on ``docker-isaac-sim``. This directory will be copied to the compute node
......@@ -105,19 +108,25 @@ specified, the default profile ``base`` will be used.
access by removing the flag in ``docker/container.sh``.
Job Submission and Execution
----------------------------
Defining the job parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------
The job parameters need to be defined based on the job scheduler used by your cluster. You only need to update the appropriate script for the scheduler available to you.
The job parameters are defined inside the ``docker/cluster/submit_job.sh``.
- For SLURM, update the parameters in ``docker/cluster/submit_job_slurm.sh``.
- For PBS, update the parameters in ``docker/cluster/submit_job_pbs.sh``.
For SLURM
~~~~~~~~~
The job parameters are defined inside the ``docker/cluster/submit_job_slurm.sh``.
A typical SLURM operation requires specifying the number of CPUs and GPUs, the memory, and
the time limit. For more information, please check the `SLURM documentation`_.
The default configuration is as follows:
.. literalinclude:: ../../../docker/cluster/submit_job.sh
.. literalinclude:: ../../../docker/cluster/submit_job_slurm.sh
:language: bash
:lines: 12-19
:linenos:
......@@ -128,16 +137,32 @@ This is required to load assets from the Nucleus server. For some cluster archit
must be loaded to allow internet access.
For instance, on ETH Zurich Euler cluster, the ``eth_proxy`` module needs to be loaded. This can be done
by adding the following line to the ``submit_job.sh`` script:
by adding the following line to the ``submit_job_slurm.sh`` script:
.. literalinclude:: ../../../docker/cluster/submit_job.sh
.. literalinclude:: ../../../docker/cluster/submit_job_slurm.sh
:language: bash
:lines: 3-5
:linenos:
:lineno-start: 3
For PBS
~~~~~~~
The job parameters are defined inside the ``docker/cluster/submit_job_pbs.sh``.
A typical PBS operation requires specifying the number of CPUs and GPUs, and the time limit. For more
information, please check the `PBS Official Site`_.
The default configuration is as follows:
.. literalinclude:: ../../../docker/cluster/submit_job_pbs.sh
:language: bash
:lines: 11-17
:linenos:
:lineno-start: 11
Submitting a job
~~~~~~~~~~~~~~~~
----------------
To submit a job on the cluster, the following command can be used:
......@@ -173,6 +198,7 @@ The above will, in addition, also render videos of the training progress and sto
.. _Singularity: https://docs.sylabs.io/guides/2.6/user-guide/index.html
.. _ETH Zurich Euler: https://scicomp.ethz.ch/wiki/Euler
.. _PBS Official Site: https://openpbs.org/
.. _apptainer: https://apptainer.org/
.. _documentation: https://www.apptainer.org/docs/admin/main/installation.html#install-ubuntu-packages
.. _SLURM documentation: https://www.slurm.schedmd.com/sbatch.html
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment