Unverified Commit c3c7f6bd authored by Pascal Roth's avatar Pascal Roth Committed by GitHub

Adds steps for deployment on SLURM clusters (#273)

# Description

Adds two options to the `docker/container.sh` file:

- `push`: which will export the latest docker container to a singluarity
image and scp it on some compute cluster
- `job`: which will launch automatically a job on a SLURM compute
cluster

Had to make minimal changes to the `Dockerfile`, the
`docker_compose.yaml`, and `env.py` which at the moment contains the
parameters for these operations. For the job settings can be formulated
in `docker/cluster/job_submit.sh`.

Currently, this setup has been tested on ETH Zurich Euler cluster.

Fixes #146

## Type of change

- New feature (non-breaking change which adds functionality)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [x] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
parent aa74b6f5
...@@ -12,7 +12,7 @@ docs/ ...@@ -12,7 +12,7 @@ docs/
**/videos/* **/videos/*
*.tmp *.tmp
# ignore docker # ignore docker
docker/ docker/exports/
# ignore __pycache__ # ignore __pycache__
**/__pycache__/ **/__pycache__/
**/*.egg-info/ **/*.egg-info/
......
...@@ -21,6 +21,10 @@ ...@@ -21,6 +21,10 @@
**/*.pyc **/*.pyc
**/*.pb **/*.pb
# Docker/Singularity
**/*.sif
docker/exports/
# IDE # IDE
**/.idea/ **/.idea/
**/.vscode/ **/.vscode/
......
...@@ -6,3 +6,21 @@ ISAACSIM_VERSION=2023.1.0-hotfix.1 ...@@ -6,3 +6,21 @@ ISAACSIM_VERSION=2023.1.0-hotfix.1
DOCKER_ISAACSIM_PATH=/isaac-sim DOCKER_ISAACSIM_PATH=/isaac-sim
# Docker user directory - by default this is the root user's home directory # Docker user directory - by default this is the root user's home directory
DOCKER_USER_HOME=/root DOCKER_USER_HOME=/root
###
# Cluster specific settings
###
# Docker cache dir for Isaac Sim (has to end on docker-isaac-sim)
# e.g. /cluster/scratch/$USER/docker-isaac-sim
CLUSTER_ISAAC_SIM_CACHE_DIR=/some/path/on/cluster/docker-isaac-sim
# Orbit directory on the cluster (has to end on orbit)
# e.g. /cluster/home/$USER/orbit
CLUSTER_ORBIT_DIR=/some/path/on/cluster/orbit
# Cluster login
CLUSTER_LOGIN=username@cluster_ip
# Cluster scratch directory to store the SIF file
# e.g. /cluster/scratch/$USER
CLUSTER_SIF_PATH=/some/path/on/cluster/
# Python executable within orbit directory to run with the submitted job
CLUSTER_PYTHON_EXECUTABLE=source/standalone/workflows/rsl_rl/train.py
...@@ -20,7 +20,8 @@ LABEL description="Dockerfile for building and running the Orbit framework insid ...@@ -20,7 +20,8 @@ LABEL description="Dockerfile for building and running the Orbit framework insid
# Arguments # Arguments
# Path to Isaac Sim root folder # Path to Isaac Sim root folder
ARG ISAACSIM_PATH ARG ISAACSIM_PATH
ARG ISAACSIM_VERSION # Path to the Docker User Home
ARG DOCKER_USER_HOME
# Set environment variables # Set environment variables
ENV LANG=C.UTF-8 ENV LANG=C.UTF-8
...@@ -36,21 +37,30 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ ...@@ -36,21 +37,30 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
apt -y autoremove && apt clean autoclean && \ apt -y autoremove && apt clean autoclean && \
rm -rf /var/lib/apt/lists/* rm -rf /var/lib/apt/lists/*
# FIXME: Only necessary for streaming until this fix is properly rolled out by NVIDIA after Isaac Sim2023.1 release # Copy the orbit directory (files to exclude are defined in .dockerignore)
# Ref: https://forums.developer.nvidia.com/t/running-a-standalone-example-with-gui-in-docker-container/248147/3
RUN if [[ "${ISAACSIM_VERSION}" == "2022.2"* ]]; then \
sed -i 's/\("omni.isaac.quadruped"\s=\s{}\)/"omni.isaac.quadruped" = {order = 10}/g' \
${ISAACSIM_PATH}/apps/omni.isaac.sim.python.kit; \
fi
# Copy the orbit directory
COPY ../ ${ORBIT_PATH} COPY ../ ${ORBIT_PATH}
# Delete the logs directory
RUN rm -rf ${ORBIT_PATH}/logs
# Set up a symbolic link between the installed Isaac Sim root folder and _isaac_sim in the orbit directory # Set up a symbolic link between the installed Isaac Sim root folder and _isaac_sim in the orbit directory
RUN ln -sf ${ISAACSIM_PATH} ${ORBIT_PATH}/_isaac_sim RUN ln -sf ${ISAACSIM_PATH} ${ORBIT_PATH}/_isaac_sim
# for singularity usage, have to create the directories that will binded
RUN mkdir -p ${ISAACSIM_PATH}/kit/cache && \
mkdir -p ${DOCKER_USER_HOME}/.cache/ov && \
mkdir -p ${DOCKER_USER_HOME}/.cache/pip && \
mkdir -p ${DOCKER_USER_HOME}/.cache/nvidia/GLCache&& \
mkdir -p ${DOCKER_USER_HOME}/.nv/ComputeCache && \
mkdir -p ${DOCKER_USER_HOME}/.nvidia-omniverse/logs && \
mkdir -p ${DOCKER_USER_HOME}/.local/share/ov/data && \
mkdir -p ${DOCKER_USER_HOME}/Documents
# for singularity usage, create NVIDIA binary placeholders
RUN touch /bin/nvidia-smi && \
touch /bin/nvidia-debugdump && \
touch /bin/nvidia-persistenced && \
touch /bin/nvidia-cuda-mps-control && \
touch /bin/nvidia-cuda-mps-server && \
touch /etc/localtime
# installing Orbit dependencies # installing Orbit dependencies
RUN ${ORBIT_PATH}/orbit.sh --install --extra RUN ${ORBIT_PATH}/orbit.sh --install --extra
# aliasing orbit.sh and python for convenience # aliasing orbit.sh and python for convenience
......
#!/bin/bash
echo "(run_singularity.py): Called on compute node with arguments $@"
#==
# Helper functions
#==
setup_directories() {
# Check and create directories
for dir in \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/cache/kit" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/cache/ov" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/cache/pip" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/cache/glcache" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/cache/computecache" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/logs" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/data" \
"${CLUSTER_ISAAC_SIM_CACHE_DIR}/documents"; do
if [ ! -d "$dir" ]; then
mkdir -p "$dir"
echo "Created directory: $dir"
fi
done
}
#==
# Main
#==
# get script directory
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# load variables to set the orbit path on the cluster
source $SCRIPT_DIR/../.env
# copy singulary image to the compute node
folder="$TMPDIR/isaac-sim.sif"
# Check if the folder exists
if [ -d "$folder" ]; then
echo "1 (run_singularity.py): Folder was already copied to local SSD."
else
tar -xf $CLUSTER_SIF_PATH/orbit.tar -C $TMPDIR
fi
# make sure that all directories exists in cache directory
setup_directories
# copy all cache files
cp -r $CLUSTER_ISAAC_SIM_CACHE_DIR $TMPDIR
# copy orbit source code
cp -r $CLUSTER_ORBIT_DIR $TMPDIR
# execute command in singularity container
singularity exec \
-B $TMPDIR/docker-isaac-sim/cache/kit:${DOCKER_ISAACSIM_PATH}/kit/cache:rw \
-B $TMPDIR/docker-isaac-sim/cache/ov:${DOCKER_USER_HOME}/.cache/ov:rw \
-B $TMPDIR/docker-isaac-sim/cache/pip:${DOCKER_USER_HOME}/.cache/pip:rw \
-B $TMPDIR/docker-isaac-sim/cache/glcache:${DOCKER_USER_HOME}/.cache/nvidia/GLCache:rw \
-B $TMPDIR/docker-isaac-sim/cache/computecache:${DOCKER_USER_HOME}/.nv/ComputeCache:rw \
-B $TMPDIR/docker-isaac-sim/logs:${DOCKER_USER_HOME}/.nvidia-omniverse/logs:rw \
-B $TMPDIR/docker-isaac-sim/data:${DOCKER_USER_HOME}/.local/share/ov/data:rw \
-B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \
-B $TMPDIR/orbit:/workspace/orbit:rw \
--nv --writable --containall $TMPDIR/orbit.sif \
bash -c "cd /workspace/orbit && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} $@"
# copy orbit logs back to host
cp -r $TMPDIR/orbit/logs $CLUSTER_ORBIT_DIR
# copy resulting cache files back to host
cp -r $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/..
echo "(run_singularity.py): Return"
#!/bin/bash
# in the case you need to load specific modules on the cluster, add them here
# e.g., `module load eth_proxy`
# create job script with compute demands
### MODIFY HERE FOR YOUR JOB ###
cat <<EOT > job.sh
#!/bin/bash
#SBATCH -n 1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=rtx_3090:1
#SBATCH --time=23:00:00
#SBATCH --mem-per-cpu=4048
#SBATCH --mail-type=END
#SBATCH --mail-user=name@mail
#SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")"
sh "$1/docker/cluster/run_singularity.sh" "${@:2}"
EOT
sbatch < job.sh
rm job.sh
...@@ -26,9 +26,25 @@ print_help () { ...@@ -26,9 +26,25 @@ print_help () {
echo -e "\tenter Begin a new bash process within an existing orbit container." echo -e "\tenter Begin a new bash process within an existing orbit container."
echo -e "\tcopy Copy build and logs artifacts from the container to the host machine." echo -e "\tcopy Copy build and logs artifacts from the container to the host machine."
echo -e "\tstop Stop the docker container and remove it." echo -e "\tstop Stop the docker container and remove it."
echo -e "\tpush Push the docker image to the cluster."
echo -e "\tjob Submit a job to the cluster."
echo -e "\n" >&2 echo -e "\n" >&2
} }
install_apptainer() {
# Installation procedure from here: https://apptainer.org/docs/admin/main/installation.html#install-ubuntu-packages
read -p "[INFO] Required 'apptainer' package could not be found. Would you like to install it via apt? (y/N)" app_answer
if [ "$app_answer" != "${app_answer#[Yy]}" ]; then
sudo apt update && sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:apptainer/ppa
sudo apt update && sudo apt install -y apptainer
else
echo "[INFO] Exiting because apptainer was not installed"
exit
fi
}
#== #==
# Main # Main
#== #==
...@@ -96,6 +112,45 @@ case $mode in ...@@ -96,6 +112,45 @@ case $mode in
docker compose --file docker-compose.yaml down docker compose --file docker-compose.yaml down
popd > /dev/null 2>&1 popd > /dev/null 2>&1
;; ;;
push)
if ! command -v apptainer &> /dev/null; then
install_apptainer
fi
# Check if .env file exists
if [ -f $SCRIPT_DIR/.env ]; then
# source env file to get cluster login and path information
source $SCRIPT_DIR/.env
# clear old exports
sudo rm -r -f /$SCRIPT_DIR/exports
mkdir -p /$SCRIPT_DIR/exports
# create singularity image
cd /$SCRIPT_DIR/exports
SINGULARITY_NOHTTPS=1 apptainer build --sandbox orbit.sif docker-daemon://orbit:latest
# tar image and send to cluster
tar -cvf /$SCRIPT_DIR/exports/orbit.tar orbit.sif
scp /$SCRIPT_DIR/exports/orbit.tar $CLUSTER_LOGIN:$CLUSTER_SIF_PATH/orbit.tar
else
echo "[Error]: ".env" file not found."
fi
;;
job)
# Check if .env file exists
if [ -f $SCRIPT_DIR/.env ]; then
# Sync orbit code
echo "[INFO] Syncing orbit code..."
source $SCRIPT_DIR/.env
rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR
# Explicitly also sync orbit_assets as long as it is still used
if [ -f /$SCRIPT_DIR/../source/extensions/omni.isaac.orbit_assets ]; then
rsync -rh --exclude="*.git*" /$SCRIPT_DIR/../source/extensions/omni.isaac.orbit_assets $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR/source/extensions
fi
# execute job script
echo "[INFO] Executing job script..."
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "${@:2}"
else
echo "[Error]: ".env" file not found."
fi
;;
*) *)
echo "[Error] Invalid argument provided: $1" echo "[Error] Invalid argument provided: $1"
print_help print_help
......
...@@ -8,6 +8,7 @@ services: ...@@ -8,6 +8,7 @@ services:
args: args:
- ISAACSIM_VERSION=${ISAACSIM_VERSION} - ISAACSIM_VERSION=${ISAACSIM_VERSION}
- ISAACSIM_PATH=${DOCKER_ISAACSIM_PATH} - ISAACSIM_PATH=${DOCKER_ISAACSIM_PATH}
- DOCKER_USER_HOME=${DOCKER_USER_HOME}
image: orbit image: orbit
container_name: orbit container_name: orbit
env_file: env_file:
......
...@@ -27,6 +27,7 @@ For more information about the framework, please refer to the `paper <https://ar ...@@ -27,6 +27,7 @@ For more information about the framework, please refer to the `paper <https://ar
source/setup/installation source/setup/installation
source/setup/developer source/setup/developer
source/setup/docker source/setup/docker
source/setup/cluster
source/setup/sample source/setup/sample
.. toctree:: .. toctree::
......
Cluster Setup
=============
Clusters are a great way to speed up training and evaluation of learning algorithms.
While the Orbit Docker image can be used to run jobs on a cluster, many clusters only
support singularity images. This is because `singularity`_ is designed for
ease-of-use on shared multi-user systems and high performance computing (HPC) environments.
It does not require root privileges to run containers and can be used to run user-defined
containers.
Singularity is compatible with all Docker images. In this section, we describe how to
convert the Orbit Docker image into a singularity image and use it to submit jobs to a cluster.
.. attention::
Cluster setup varies across different institutions. The following instructions have been
tested on the `ETH Zurich Euler`_ cluster, which uses the SLURM workload manager.
The instructions may need to be adapted for other clusters. If you have successfully
adapted the instructions for another cluster, please consider contributing to the
documentation.
Setup Instructions
------------------
In order to export the Docker Image to a singularity image, `apptainer`_ is required.
A detailed overview of the installation procedure for ``apptainer`` can be found in its
`documentation`_. For convenience, we summarize the steps here for a local installation:
.. code:: bash
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:apptainer/ppa
sudo apt update
sudo apt install -y apptainer
For simplicity, we recommend that an SSH connection is set up between the local
development machine and the cluster. Such a connection will simplify the file transfer and prevent
the user cluster password from being requested multiple times.
Configuring the cluster parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First, you need to configure the cluster-specific parameters in ``docker/.env`` file.
The following describes the parameters that need to be configured:
- ``CLUSTER_ISAAC_SIM_CACHE_DIR``:
The directory on the cluster where the Isaac Sim cache is stored. This directory
has to end on ``docker-isaac-sim``. This directory will be copied to the compute node
and mounted into the singularity container. It should increase the speed of starting
the simulation.
- ``CLUSTER_ORBIT_DIR``:
The directory on the cluster where the orbit code is stored. This directory has to
end on ``orbit``. This directory will be copied to the compute node and mounted into
the singularity container. When a job is submitted, the latest local changes will
be copied to the cluster.
- ``CLUSTER_LOGIN``:
The login to the cluster. Typically, this is the user and cluster names,
e.g., ``your_user@euler.ethz.ch``.
- ``CLUSTER_SIF_PATH``:
The path on the cluster where the singularity image will be stored. The image will be
copied to the compute node but not uploaded again to the cluster when a job is submitted.
- ``CLUSTER_PYTHON_EXECUTABLE``:
The path within orbit to the Python executable that should be executed in the submitted job.
Exporting to singularity image
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Next, we need to export the Docker image to a singularity image and upload
it to the cluster. This step is only required once when the first job is submitted
or when the Docker image is updated. For instance, due to an upgrade of the Isaac Sim
version, or additional requirements for your project.
To export to a singularity image, execute the following command:
.. code:: bash
./docker/container.sh push
This command will create a singularity image under ``docker/exports`` directory and
upload it to the defined location on the cluster. Be aware that creating the singularity
image can take a while.
Job Submission and Execution
----------------------------
Defining the job parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The job parameters are defined inside the ``docker/cluster/submit_job.sh``.
A typical SLURM operation requires specifying the number of CPUs and GPUs, the memory, and
the time limit. For more information, please check the `SLURM documentation`_.
The default configuration is as follows:
.. literalinclude:: ../../../docker/cluster/submit_job.sh
:language: bash
:lines: 12-19
:linenos:
:lineno-start: 12
An essential requirement for the cluster is that the compute node has access to the internet at all times.
This is required to load assets from the Nucleus server. For some cluster architectures, extra modules
must be loaded to allow internet access.
For instance, on ETH Zurich Euler cluster, the ``eth_proxy`` module needs to be loaded. This can be done
by adding the following line to the ``submit_job.sh`` script:
.. literalinclude:: ../../../docker/cluster/submit_job.sh
:language: bash
:lines: 3-5
:linenos:
:lineno-start: 3
Submitting a job
~~~~~~~~~~~~~~~~
To submit a job on the cluster, the following command can be used:
.. code:: bash
./docker/container.sh job "argument1" "argument2" ...
This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that
your Python executable's output is stored under ``orbit/logs`` as this directory will be copied again
from the compute node to ``CLUSTER_ORBIT_DIR``.
The training arguments anove are passed to the Python executable. As an example, the standard
ANYmal rough terrain locomotion training can be executed with the following command:
.. code:: bash
./docker/container.sh job ./docker/container.sh job --task Isaac-Velocity-Rough-Anymal-C-v0 --headless --video --offscreen_render
The above will, in addition, also render videos of the training progress and store them under ``orbit/logs`` directory.
.. note::
The ``./docker/container.sh job`` command will copy the latest changes in your code to the cluster. However,
it will not delete any files that have been deleted locally. These files will still exist on the cluster
which can lead to issues. In this case, we recommend removing the ``CLUSTER_ORBIT_DIR`` directory on
the cluster and re-run the command.
.. _Singularity: https://docs.sylabs.io/guides/2.6/user-guide/index.html
.. _ETH Zurich Euler: https://scicomp.ethz.ch/wiki/Euler
.. _apptainer: https://apptainer.org/
.. _documentation: www.apptainer.org/docs/admin/main/installation.html#install-ubuntu-packages
.. _SLURM documentation: www.slurm.schedmd.com/sbatch.html
...@@ -176,8 +176,14 @@ update_vscode_settings() { ...@@ -176,8 +176,14 @@ update_vscode_settings() {
echo "[INFO] Setting up vscode settings..." echo "[INFO] Setting up vscode settings..."
# retrieve the python executable # retrieve the python executable
python_exe=$(extract_python_exe) python_exe=$(extract_python_exe)
# run the setup script # path to setup_vscode.py
${python_exe} ${ORBIT_PATH}/.vscode/tools/setup_vscode.py setup_vscode_script="${ORBIT_PATH}/.vscode/tools/setup_vscode.py"
# check if the file exists before attempting to run it
if [ -f "${setup_vscode_script}" ]; then
${python_exe} "${setup_vscode_script}"
else
echo "[WARNING] setup_vscode.py not found. Aborting vscode settings setup."
fi
} }
# print the usage description # print the usage description
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment