Unverified Commit 36c58ea0 authored by Pascal Roth's avatar Pascal Roth Committed by GitHub

Adds ability to launch multiple jobs on cluster with different code changes (#832)

# Description

Currently, for multiple runs, the code is pushed to the same location on
the cluster. When the runs are not directly started (i.e.,
are in the queue), the code will be replaced each time a new job is
submitted. This PR creates a new copy of the code for each job to ensure
that the intended version is used for the submitted job.

## Type of change

- New feature (non-breaking change which adds functionality)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have run all the tests with `./isaaclab.sh --test` and they pass
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
parent bf068c83
...@@ -16,5 +16,7 @@ CLUSTER_LOGIN=username@cluster_ip ...@@ -16,5 +16,7 @@ CLUSTER_LOGIN=username@cluster_ip
# Cluster scratch directory to store the SIF file # Cluster scratch directory to store the SIF file
# e.g. /cluster/scratch/$USER # e.g. /cluster/scratch/$USER
CLUSTER_SIF_PATH=/some/path/on/cluster/ CLUSTER_SIF_PATH=/some/path/on/cluster/
# Remove the temporary isaaclab code copy after the job is done
REMOVE_CODE_COPY_AFTER_JOB=false
# Python executable within Isaac Lab directory to run with the submitted job # Python executable within Isaac Lab directory to run with the submitted job
CLUSTER_PYTHON_EXECUTABLE=source/standalone/workflows/rsl_rl/train.py CLUSTER_PYTHON_EXECUTABLE=source/standalone/workflows/rsl_rl/train.py
...@@ -168,6 +168,10 @@ case $command in ...@@ -168,6 +168,10 @@ case $command in
[ -n "$profile" ] && echo "Using profile: $profile" [ -n "$profile" ] && echo "Using profile: $profile"
[ -n "$job_args" ] && echo "Job arguments: $job_args" [ -n "$job_args" ] && echo "Job arguments: $job_args"
source $SCRIPT_DIR/.env.cluster source $SCRIPT_DIR/.env.cluster
# Get current date and time
current_datetime=$(date +"%Y%m%d_%H%M%S")
# Append current date and time to CLUSTER_ISAACLAB_DIR
CLUSTER_ISAACLAB_DIR="${CLUSTER_ISAACLAB_DIR}_${current_datetime}"
# Check if singularity image exists on the remote host # Check if singularity image exists on the remote host
check_singularity_image_exists isaac-lab-$profile check_singularity_image_exists isaac-lab-$profile
# make sure target directory exists # make sure target directory exists
......
#!/usr/bin/env bash #!/usr/bin/env bash
echo "(run_singularity.py): Called on compute node with container profile $1 and arguments ${@:2}" echo "(run_singularity.py): Called on compute node from current isaaclab directory $1 with container profile $2 and arguments ${@:3}"
#== #==
# Helper functions # Helper functions
...@@ -42,13 +42,17 @@ setup_directories ...@@ -42,13 +42,17 @@ setup_directories
# copy all cache files # copy all cache files
cp -r $CLUSTER_ISAAC_SIM_CACHE_DIR $TMPDIR cp -r $CLUSTER_ISAAC_SIM_CACHE_DIR $TMPDIR
# copy Isaac Lab source code # make sure logs directory exists (in the permanent isaaclab directory)
mkdir -p "$CLUSTER_ISAACLAB_DIR/logs" mkdir -p "$CLUSTER_ISAACLAB_DIR/logs"
touch "$CLUSTER_ISAACLAB_DIR/logs/.keep" touch "$CLUSTER_ISAACLAB_DIR/logs/.keep"
cp -r $CLUSTER_ISAACLAB_DIR $TMPDIR
# copy the temporary isaaclab directory with the latest changes to the compute node
cp -r $1 $TMPDIR
# Get the directory name
dir_name=$(basename "$1")
# copy container to the compute node # copy container to the compute node
tar -xf $CLUSTER_SIF_PATH/$1.tar -C $TMPDIR tar -xf $CLUSTER_SIF_PATH/$2.tar -C $TMPDIR
# execute command in singularity container # execute command in singularity container
# NOTE: ISAACLAB_PATH is normally set in `isaaclab.sh` but we directly call the isaac-sim python because we sync the entire # NOTE: ISAACLAB_PATH is normally set in `isaaclab.sh` but we directly call the isaac-sim python because we sync the entire
...@@ -62,12 +66,17 @@ singularity exec \ ...@@ -62,12 +66,17 @@ singularity exec \
-B $TMPDIR/docker-isaac-sim/logs:${DOCKER_USER_HOME}/.nvidia-omniverse/logs:rw \ -B $TMPDIR/docker-isaac-sim/logs:${DOCKER_USER_HOME}/.nvidia-omniverse/logs:rw \
-B $TMPDIR/docker-isaac-sim/data:${DOCKER_USER_HOME}/.local/share/ov/data:rw \ -B $TMPDIR/docker-isaac-sim/data:${DOCKER_USER_HOME}/.local/share/ov/data:rw \
-B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \ -B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \
-B $TMPDIR/isaaclab:/workspace/isaaclab:rw \ -B $TMPDIR/$dir_name:/workspace/isaaclab:rw \
-B $CLUSTER_ISAACLAB_DIR/logs:/workspace/isaaclab/logs:rw \ -B $CLUSTER_ISAACLAB_DIR/logs:/workspace/isaaclab/logs:rw \
--nv --writable --containall $TMPDIR/$1.sif \ --nv --writable --containall $TMPDIR/$2.sif \
bash -c "export ISAACLAB_PATH=/workspace/isaaclab && cd /workspace/isaaclab && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} ${@:2}" bash -c "export ISAACLAB_PATH=/workspace/isaaclab && cd /workspace/isaaclab && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} ${@:3}"
# copy resulting cache files back to host # copy resulting cache files back to host
cp -r $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/.. rsync -azPv $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/..
# if defined, remove the temporary isaaclab directory pushed when the job was submitted
if $REMOVE_CODE_COPY_AFTER_JOB; then
rm -rf $1
fi
echo "(run_singularity.py): Return" echo "(run_singularity.py): Return"
...@@ -16,7 +16,7 @@ cat <<EOT > job.sh ...@@ -16,7 +16,7 @@ cat <<EOT > job.sh
#PBS -m bea -M "user@mail" #PBS -m bea -M "user@mail"
# Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script # Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
sh "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}" bash "$1/docker/cluster/run_singularity.sh" "$1" "$2" "${@:3}"
EOT EOT
qsub job.sh qsub job.sh
......
...@@ -18,7 +18,7 @@ cat <<EOT > job.sh ...@@ -18,7 +18,7 @@ cat <<EOT > job.sh
#SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")" #SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")"
# Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script # Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
bash "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}" bash "$1/docker/cluster/run_singularity.sh" "$1" "$2" "${@:3}"
EOT EOT
sbatch < job.sh sbatch < job.sh
......
...@@ -76,16 +76,22 @@ The following describes the parameters that need to be configured: ...@@ -76,16 +76,22 @@ The following describes the parameters that need to be configured:
and mounted into the singularity container. This should increase the speed of starting and mounted into the singularity container. This should increase the speed of starting
the simulation. the simulation.
* - CLUSTER_ISAACLAB_DIR * - CLUSTER_ISAACLAB_DIR
- The directory on the cluster where the Isaac Lab code is stored. This directory has to - The directory on the cluster where the Isaac Lab logs are stored. This directory has to
end on ``isaaclab``. It will be copied to the compute node and mounted into end on ``isaaclab``. It will be copied to the compute node and mounted into
the singularity container. When a job is submitted, the latest local changes will the singularity container. When a job is submitted, the latest local changes will
be copied to the cluster. be copied to the cluster to a new directory in the format ``${CLUSTER_ISAACLAB_DIR}_${datetime}``
with the date and time of the job submission. This allows to run multiple jobs with different code versions at
the same time.
* - CLUSTER_LOGIN * - CLUSTER_LOGIN
- The login to the cluster. Typically, this is the user and cluster names, - The login to the cluster. Typically, this is the user and cluster names,
e.g., ``your_user@euler.ethz.ch``. e.g., ``your_user@euler.ethz.ch``.
* - CLUSTER_SIF_PATH * - CLUSTER_SIF_PATH
- The path on the cluster where the singularity image will be stored. The image will be - The path on the cluster where the singularity image will be stored. The image will be
copied to the compute node but not uploaded again to the cluster when a job is submitted. copied to the compute node but not uploaded again to the cluster when a job is submitted.
* - REMOVE_CODE_COPY_AFTER_JOB
- Whether the copied code should be removed after the job is finished or not. The logs from the job will not be deleted
as these are saved under the permanent ``CLUSTER_ISAACLAB_DIR``. This feature is useful
to save disk space on the cluster. If set to ``true``, the code copy will be removed.
* - CLUSTER_PYTHON_EXECUTABLE * - CLUSTER_PYTHON_EXECUTABLE
- The path within Isaac Lab to the Python executable that should be executed in the submitted job. - The path within Isaac Lab to the Python executable that should be executed in the submitted job.
...@@ -122,8 +128,8 @@ specified, the default profile ``base`` will be used. ...@@ -122,8 +128,8 @@ specified, the default profile ``base`` will be used.
Defining the job parameters Defining the job parameters
--------------------------- ---------------------------
The job parameters need to be defined based on the job scheduler used by your cluster.
The job parameters need to be defined based on the job scheduler used by your cluster. You only need to update the appropriate script for the scheduler available to you. You only need to update the appropriate script for the scheduler available to you.
- For SLURM, update the parameters in ``docker/cluster/submit_job_slurm.sh``. - For SLURM, update the parameters in ``docker/cluster/submit_job_slurm.sh``.
- For PBS, update the parameters in ``docker/cluster/submit_job_pbs.sh``. - For PBS, update the parameters in ``docker/cluster/submit_job_pbs.sh``.
...@@ -182,8 +188,8 @@ To submit a job on the cluster, the following command can be used: ...@@ -182,8 +188,8 @@ To submit a job on the cluster, the following command can be used:
./docker/cluster/cluster_interface.sh job [profile] "argument1" "argument2" ... ./docker/cluster/cluster_interface.sh job [profile] "argument1" "argument2" ...
This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that
your Python executable's output is stored under ``isaaclab/logs`` as this directory will be copied again your Python executable's output is stored under ``isaaclab/logs`` as this directory is synced between the compute
from the compute node to ``CLUSTER_ISAACLAB_DIR``. node and ``CLUSTER_ISAACLAB_DIR``.
``[profile]`` is an optional argument that specifies which singularity image corresponding to the container profile ``[profile]`` is an optional argument that specifies which singularity image corresponding to the container profile
will be used. If no profile is specified, the default profile ``base`` will be used. The profile has be defined will be used. If no profile is specified, the default profile ``base`` will be used. The profile has be defined
...@@ -199,14 +205,6 @@ ANYmal rough terrain locomotion training can be executed with the following comm ...@@ -199,14 +205,6 @@ ANYmal rough terrain locomotion training can be executed with the following comm
The above will, in addition, also render videos of the training progress and store them under ``isaaclab/logs`` directory. The above will, in addition, also render videos of the training progress and store them under ``isaaclab/logs`` directory.
.. note::
The ``./docker/cluster/cluster_interface.sh job`` command will copy the latest changes in your code to the cluster. However,
it will not delete any files that have been deleted locally. These files will still exist on the cluster
which can lead to issues. In this case, we recommend removing the ``CLUSTER_ISAACLAB_DIR`` directory on
the cluster and re-run the command.
.. _Singularity: https://docs.sylabs.io/guides/2.6/user-guide/index.html .. _Singularity: https://docs.sylabs.io/guides/2.6/user-guide/index.html
.. _ETH Zurich Euler: https://scicomp.ethz.ch/wiki/Euler .. _ETH Zurich Euler: https://scicomp.ethz.ch/wiki/Euler
.. _PBS Official Site: https://openpbs.org/ .. _PBS Official Site: https://openpbs.org/
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment