Unverified Commit 6ef5930d authored by Pascal Roth's avatar Pascal Roth Committed by GitHub

Fixes cluster workflow to work with different container profiles (#486)

# Description

Cluster workflow did not work with the different profiles and introduced
names. This PR fixes the workflow and in addition, introduces additional
checks that the profile can be selected. In detail:

- checks whether a profile can be selected depending on whether a
`.env.$container_profile` exists
- allows for `job` to have multiple arguments, also without a profile,
for all other options, the second argument has to be the profile
- check if a docker image exists before building the singularity image
- check if the path for the singularity image exists on the cluster,
otherwise create it
- check if the path for orbit exists on the cluster, otherwise create it


## Type of change

- Bug fix (non-breaking change which fixes an issue)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have run all the tests with `./orbit.sh --test` and they pass
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

---------
Co-authored-by: 's avatarLeul Tesfaye <lst26@cornell.edu>
Co-authored-by: 's avatarMayank Mittal <12863862+Mayankm96@users.noreply.github.com>
parent 3ab2acef
#!/bin/bash #!/bin/bash
echo "(run_singularity.py): Called on compute node with arguments $@" echo "(run_singularity.py): Called on compute node with container profile $1 and arguments ${@:2}"
#== #==
# Helper functions # Helper functions
...@@ -46,19 +46,14 @@ mkdir -p "$CLUSTER_ORBIT_DIR/logs" ...@@ -46,19 +46,14 @@ mkdir -p "$CLUSTER_ORBIT_DIR/logs"
touch "$CLUSTER_ORBIT_DIR/logs/.keep" touch "$CLUSTER_ORBIT_DIR/logs/.keep"
cp -r $CLUSTER_ORBIT_DIR $TMPDIR cp -r $CLUSTER_ORBIT_DIR $TMPDIR
# copy singulary image to the compute node # copy container to the compute node
folder="$TMPDIR/isaac-sim.sif" tar -xf $CLUSTER_SIF_PATH/$1.tar -C $TMPDIR
# Check if the folder exists
if [ -d "$folder" ]; then
echo "1 (run_singularity.py): Folder was already copied to local SSD."
else
tar -xf $CLUSTER_SIF_PATH/orbit.tar -C $TMPDIR
fi
# execute command in singularity container # execute command in singularity container
# NOTE: ORBIT_PATH is normally set in `orbit.sh` but we directly call the isaac-sim python because we sync the entire
# orbit directory to the compute node and remote the symbolic link to isaac-sim
singularity exec \ singularity exec \
-B $TMPDIR/docker-isaac-sim/cache/kit:${DOCKER_ISAACSIM_PATH}/kit/cache:rw \ -B $TMPDIR/docker-isaac-sim/cache/kit:${DOCKER_ISAACSIM_ROOT_PATH}/kit/cache:rw \
-B $TMPDIR/docker-isaac-sim/cache/ov:${DOCKER_USER_HOME}/.cache/ov:rw \ -B $TMPDIR/docker-isaac-sim/cache/ov:${DOCKER_USER_HOME}/.cache/ov:rw \
-B $TMPDIR/docker-isaac-sim/cache/pip:${DOCKER_USER_HOME}/.cache/pip:rw \ -B $TMPDIR/docker-isaac-sim/cache/pip:${DOCKER_USER_HOME}/.cache/pip:rw \
-B $TMPDIR/docker-isaac-sim/cache/glcache:${DOCKER_USER_HOME}/.cache/nvidia/GLCache:rw \ -B $TMPDIR/docker-isaac-sim/cache/glcache:${DOCKER_USER_HOME}/.cache/nvidia/GLCache:rw \
...@@ -68,8 +63,8 @@ singularity exec \ ...@@ -68,8 +63,8 @@ singularity exec \
-B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \ -B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \
-B $TMPDIR/orbit:/workspace/orbit:rw \ -B $TMPDIR/orbit:/workspace/orbit:rw \
-B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \ -B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \
--nv --writable --containall $TMPDIR/orbit.sif \ --nv --writable --containall $TMPDIR/$1.sif \
bash -c "cd /workspace/orbit && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} $@" bash -c "export ORBIT_PATH=/workspace/orbit && cd /workspace/orbit && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} ${@:2}"
# copy resulting cache files back to host # copy resulting cache files back to host
cp -r $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/.. cp -r $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/..
......
...@@ -17,7 +17,8 @@ cat <<EOT > job.sh ...@@ -17,7 +17,8 @@ cat <<EOT > job.sh
#SBATCH --mail-user=name@mail #SBATCH --mail-user=name@mail
#SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")" #SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")"
sh "$1/docker/cluster/run_singularity.sh" "${@:2}" # Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
sh "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}"
EOT EOT
sbatch < job.sh sbatch < job.sh
......
...@@ -21,13 +21,16 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" ...@@ -21,13 +21,16 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
print_help () { print_help () {
echo -e "\nusage: $(basename "$0") [-h] [run] [start] [stop] -- Utility for handling docker in Orbit." echo -e "\nusage: $(basename "$0") [-h] [run] [start] [stop] -- Utility for handling docker in Orbit."
echo -e "\noptional arguments:" echo -e "\noptional arguments:"
echo -e "\t-h, --help Display the help content." echo -e "\t-h, --help Display the help content."
echo -e "\tstart Build the docker image and create the container in detached mode." echo -e "\tstart [profile] Build the docker image and create the container in detached mode."
echo -e "\tenter Begin a new bash process within an existing orbit container." echo -e "\tenter [profile] Begin a new bash process within an existing orbit container."
echo -e "\tcopy Copy build and logs artifacts from the container to the host machine." echo -e "\tcopy [profile] Copy build and logs artifacts from the container to the host machine."
echo -e "\tstop Stop the docker container and remove it." echo -e "\tstop [profile] Stop the docker container and remove it."
echo -e "\tpush Push the docker image to the cluster." echo -e "\tpush [profile] Push the docker image to the cluster."
echo -e "\tjob Submit a job to the cluster." echo -e "\tjob [profile] [job_args] Submit a job to the cluster."
echo -e "\n"
echo -e "[profile] is the optional container profile specification and [job_args] optional arguments specific"
echo -e "to the executed script"
echo -e "\n" >&2 echo -e "\n" >&2
} }
...@@ -64,12 +67,24 @@ check_docker_version() { ...@@ -64,12 +67,24 @@ check_docker_version() {
resolve_image_extension() { resolve_image_extension() {
# If no profile was passed, we default to 'base' # If no profile was passed, we default to 'base'
container_profile=${1:-"base"} container_profile=${1:-"base"}
# check if the second argument has to be a profile or can be a job argument instead
necessary_profile=${2:-true}
# We also default to 'base' if "orbit" is passed # We also default to 'base' if "orbit" is passed
if [ "$1" == "orbit" ]; then if [ "$1" == "orbit" ]; then
container_profile="base" container_profile="base"
fi fi
# check if a .env.$container_profile file exists
# if the argument is necessary a profile, then the file must exists otherwise an info is printed
if [ "$necessary_profile" = true ] && [ ! -f $SCRIPT_DIR/.env.$container_profile ]; then
echo "[Error] The profile '$container_profile' has no .env.$container_profile file!" >&2;
exit 1
elif [ ! -f $SCRIPT_DIR/.env.$container_profile ]; then
echo "[INFO] No .env.$container_profile found, assume second argument is no profile! Will use default container!" >&2;
container_profile="base"
fi
add_profiles="--profile $container_profile" add_profiles="--profile $container_profile"
# We will need .env.base regardless of profile # We will need .env.base regardless of profile
add_envs="--env-file .env.base" add_envs="--env-file .env.base"
...@@ -92,6 +107,24 @@ is_container_running() { ...@@ -92,6 +107,24 @@ is_container_running() {
fi fi
} }
# Checks if a docker image exists, otherwise prints warning and exists
check_image_exists() {
image_name="$1"
if ! docker image inspect $image_name &> /dev/null; then
echo "[Error] The '$image_name' image does not exist!" >&2;
exit 1
fi
}
# Check if the singularity image exists on the remote host, otherwise print warning and exit
check_singularity_image_exists() {
image_name="$1"
if ! ssh "$CLUSTER_LOGIN" "[ -f $CLUSTER_SIF_PATH/$image_name.tar ]"; then
echo "[Error] The '$image_name' image does not exist on the remote host $CLUSTER_LOGIN!" >&2;
exit 1
fi
}
#== #==
# Main # Main
#== #==
...@@ -111,7 +144,27 @@ fi ...@@ -111,7 +144,27 @@ fi
# parse arguments # parse arguments
mode="$1" mode="$1"
resolve_image_extension $2 profile_arg="$2" # Capture the second argument as the potential profile argument
# Check mode argument and resolve the container profile
case $mode in
build|start|enter|copy|stop|push)
resolve_image_extension "$profile_arg" true
;;
job)
resolve_image_extension "$profile_arg" false
;;
*)
# Not recognized mode
echo "[Error] Invalid command provided: $mode"
print_help
exit 1
;;
esac
# Produces a nice print statement stating which container profile is being used
echo "[INFO] Using container profile: $container_profile"
# resolve mode # resolve mode
case $mode in case $mode in
start) start)
...@@ -169,43 +222,53 @@ case $mode in ...@@ -169,43 +222,53 @@ case $mode in
if ! command -v apptainer &> /dev/null; then if ! command -v apptainer &> /dev/null; then
install_apptainer install_apptainer
fi fi
# Check if Docker image exists
check_image_exists orbit-$container_profile:latest
# Check if Docker version is greater than 25 # Check if Docker version is greater than 25
check_docker_version check_docker_version
# Check if .env.base file exists # source env file to get cluster login and path information
if [ -f $SCRIPT_DIR/.env.base ]; then source $SCRIPT_DIR/.env.base
# source env file to get cluster login and path information # make sure exports directory exists
source $SCRIPT_DIR/.env.base mkdir -p /$SCRIPT_DIR/exports
# clear old exports # clear old exports for selected profile
rm -rf /$SCRIPT_DIR/exports rm -rf /$SCRIPT_DIR/exports/orbit-$container_profile*
mkdir -p /$SCRIPT_DIR/exports # create singularity image
# create singularity image # NOTE: we create the singularity image as non-root user to allow for more flexibility. If this causes
# NOTE: we create the singularity image as non-root user to allow for more flexibility. If this causes # issues, remove the --fakeroot flag and open an issue on the orbit repository.
# issues, remove the --fakeroot flag and open an issue on the orbit repository. cd /$SCRIPT_DIR/exports
cd /$SCRIPT_DIR/exports APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot orbit-$container_profile.sif docker-daemon://orbit-$container_profile:latest
APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot orbit.sif docker-daemon://orbit:latest # tar image (faster to send single file as opposed to directory with many files)
# tar image and send to cluster tar -cvf /$SCRIPT_DIR/exports/orbit-$container_profile.tar orbit-$container_profile.sif
tar -cvf /$SCRIPT_DIR/exports/orbit.tar orbit.sif # make sure target directory exists
scp /$SCRIPT_DIR/exports/orbit.tar $CLUSTER_LOGIN:$CLUSTER_SIF_PATH/orbit.tar ssh $CLUSTER_LOGIN "mkdir -p $CLUSTER_SIF_PATH"
else # send image to cluster
echo "[Error]: ".env.base" file not found." scp $SCRIPT_DIR/exports/orbit-$container_profile.tar $CLUSTER_LOGIN:$CLUSTER_SIF_PATH/orbit-$container_profile.tar
fi
;; ;;
job) job)
# Check if .env file exists source $SCRIPT_DIR/.env.base
if [ -f $SCRIPT_DIR/.env.base ]; then # Check if singularity image exists on the remote host
# Sync orbit code check_singularity_image_exists orbit-$container_profile
echo "[INFO] Syncing orbit code..." # make sure target directory exists
source $SCRIPT_DIR/.env.base ssh $CLUSTER_LOGIN "mkdir -p $CLUSTER_ORBIT_DIR"
rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR # Sync orbit code
# execute job script echo "[INFO] Syncing orbit code..."
echo "[INFO] Executing job script..." rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "${@:2}" # execute job script
echo "[INFO] Executing job script..."
# check whether the second argument is a profile or a job argument
if [ "$profile_arg" == "$container_profile" ] ; then
# if the second argument is a profile, we have to shift the arguments
echo "[INFO] Arguments passed to job script ${@:3}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "orbit-$container_profile" "${@:3}"
else else
echo "[Error]: ".env.base" file not found." # if the second argument is a job argument, we have to shift only one argument
echo "[INFO] Arguments passed to job script ${@:2}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "orbit-$container_profile" "${@:2}"
fi fi
;; ;;
*) *)
echo "[Error] Invalid argument provided: $1" # Not recognized mode
echo "[Error] Invalid command provided: $mode"
print_help print_help
exit 1 exit 1
;; ;;
......
...@@ -91,11 +91,13 @@ To export to a singularity image, execute the following command: ...@@ -91,11 +91,13 @@ To export to a singularity image, execute the following command:
.. code:: bash .. code:: bash
./docker/container.sh push ./docker/container.sh push [profile]
This command will create a singularity image under ``docker/exports`` directory and This command will create a singularity image under ``docker/exports`` directory and
upload it to the defined location on the cluster. Be aware that creating the singularity upload it to the defined location on the cluster. Be aware that creating the singularity
image can take a while. image can take a while.
``[profile]`` is an optional argument that specifies the container profile to be used. If no profile is
specified, the default profile ``base`` will be used.
.. note:: .. note::
By default, the singularity image is created without root access by providing the ``--fakeroot`` flag to By default, the singularity image is created without root access by providing the ``--fakeroot`` flag to
...@@ -141,13 +143,18 @@ To submit a job on the cluster, the following command can be used: ...@@ -141,13 +143,18 @@ To submit a job on the cluster, the following command can be used:
.. code:: bash .. code:: bash
./docker/container.sh job "argument1" "argument2" ... ./docker/container.sh job [profile] "argument1" "argument2" ...
This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that
your Python executable's output is stored under ``orbit/logs`` as this directory will be copied again your Python executable's output is stored under ``orbit/logs`` as this directory will be copied again
from the compute node to ``CLUSTER_ORBIT_DIR``. from the compute node to ``CLUSTER_ORBIT_DIR``.
The training arguments anove are passed to the Python executable. As an example, the standard ``[profile]`` is an optional argument that specifies which singularity image corresponding to the container profile
will be used. If no profile is specified, the default profile ``base`` will be used. The profile has be defined
directlty after the ``job`` command. All other arguments are passed to the Python executable. If no profile is
defined, all arguments are passed to the Python executable.
The training arguments are passed to the Python executable. As an example, the standard
ANYmal rough terrain locomotion training can be executed with the following command: ANYmal rough terrain locomotion training can be executed with the following command:
.. code:: bash .. code:: bash
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment