Adds the multi-agent RL environment (#93)

This PR adds the interface and configuration for creating multi-agent tasks using the direct workflow.  - New feature (non-breaking change which adds functionality) - This change requires a documentation update - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Adds the multi-agent RL environment (#93)
This PR adds the interface and configuration for creating multi-agent tasks using the direct workflow.  - New feature (non-breaking change which adds functionality) - This change requires a documentation update - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
a46f9348 · Toni-SM · David Hoeller · e97eb784 · a46f9348 · a46f9348
Commit a46f9348 authored Aug 17, 2024 by Toni-SM Committed by David Hoeller Sep 20, 2024
12 changed files
--- a/docs/source/api/lab/omni.isaac.lab.envs.rst
+++ b/docs/source/api/lab/omni.isaac.lab.envs.rst
@@ -20,6 +20,8 @@
    ManagerBasedRLEnvCfg
    DirectRLEnv
    DirectRLEnvCfg
+    DirectMARLEnv
+    DirectMARLEnvCfg
    ViewerCfg

 Manager Based Environment
@@ -60,6 +62,20 @@ Direct RL Environment
    :show-inheritance:
    :exclude-members: __init__, class_type

+Direct Multi-Agent RL Environment
+---------------------------------
+
+.. autoclass:: DirectMARLEnv
+    :members:
+    :inherited-members:
+    :show-inheritance:
+
+.. autoclass:: DirectMARLEnvCfg
+    :members:
+    :inherited-members:
+    :show-inheritance:
+    :exclude-members: __init__, class_type
+
 Common
 ------


--- a/docs/source/features/task_workflows.rst
+++ b/docs/source/features/task_workflows.rst
@@ -91,12 +91,13 @@ Direct Environments
 The direct-style environment aligns more closely with traditional implementations of environments,
 where a single script directly implements the reward function, observation function, resets, and all the other components
 of the environment. This approach does not require the manager classes. Instead, users are provided the complete freedom
-to implement their task through the APIs from the base class :class:`envs.DirectRLEnv`. For users migrating from the `IsaacGymEnvs`_
-and `OmniIsaacGymEnvs`_ framework, this workflow may be more familiar.
+to implement their task through the APIs from the base classes :class:`envs.DirectRLEnv` or :class:`envs.DirectMARLEnv`.
+For users migrating from the `IsaacGymEnvs`_ and `OmniIsaacGymEnvs`_ framework, this workflow may be more familiar.

 When defining an environment with the direct-style implementation, we expect the user define a single class that
-implements the entire environment. The task class should inherit from the base :class:`envs.DirectRLEnv` class and should
-have its corresponding configuration class that inherits from :class:`envs.DirectRLEnvCfg`. The task class is responsible
+implements the entire environment. The task class should inherit from the base classes :class:`envs.DirectRLEnv` or
+:class:`envs.DirectMARLEnv` and should have its corresponding configuration class that inherits from
+:class:`envs.DirectRLEnvCfg` or :class:`envs.DirectMARLEnvCfg` respectively. The task class is responsible
 for setting up the scene, processing the actions, computing the rewards, observations, resets, and termination signals.

 .. dropdown:: Example for defining the reward function for the Cartpole task using the direct-style

--- a/docs/source/tutorials/03_envs/register_rl_env_gym.rst
+++ b/docs/source/tutorials/03_envs/register_rl_env_gym.rst
@@ -53,7 +53,8 @@ are running simultaneously in the same process, and all the data is returned in
 fashion.

 Similarly, the :class:`envs.DirectRLEnv` class also inherits from the :class:`gymnasium.Env` class
-for the direct workflow.
+for the direct workflow. For :class:`envs.DirectMARLEnv`, although it does not inherit
+from Gymnasium, it can be registered and created in the same way.

 Using the gym registry
 ----------------------

--- a/source/extensions/omni.isaac.lab/config/extension.toml
+++ b/source/extensions/omni.isaac.lab/config/extension.toml
 [package]

 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.22.12"
+version = "0.23.10"

 # Description
 title = "Isaac Lab framework for Robot Learning"

--- a/source/extensions/omni.isaac.lab/docs/CHANGELOG.rst
+++ b/source/extensions/omni.isaac.lab/docs/CHANGELOG.rst
 Changelog
 ---------

-0.22.12 (2024-09-08)
-~~~~~~~~~~~~~~~~~~~~
-
-Changed
-^^^^^^^
-
-* Moved the configuration of visualization markers for the command terms to their respective configuration classes.
-  This allows users to modify the markers for the command terms without having to modify the command term classes.
-
-
-0.22.11 (2024-09-10)
+0.23.10 (2024-09-10)
 ~~~~~~~~~~~~~~~~~~~~

 Added
@@ -20,7 +10,7 @@ Added
 * Added config class, support, and tests for MJCF conversion via standalone python scripts.


-0.22.10 (2024-09-09)
+0.23.9 (2024-09-09)
 ~~~~~~~~~~~~~~~~~~~~

 Added
@@ -32,7 +22,7 @@ Added
  file or the command line argument. This ensures that the simulation results are reproducible across different runs.


-0.22.9 (2024-09-08)
+0.23.8 (2024-09-08)
 ~~~~~~~~~~~~~~~~~~~

 Changed
@@ -42,7 +32,7 @@ Changed
  for faster processing of high dimensional input tensors.


-0.22.8 (2024-09-06)
+0.23.7 (2024-09-06)
 ~~~~~~~~~~~~~~~~~~~

 Added
@@ -53,7 +43,7 @@ Added
  instance variables instead.


-0.22.7 (2024-09-05)
+0.23.6 (2024-09-05)
 ~~~~~~~~~~~~~~~~~~~

 Fixed
@@ -63,7 +53,7 @@ Fixed
  more-intuitive to control the y-axis motion based on the right-hand rule.


-0.22.6 (2024-08-29)
+0.23.5 (2024-08-29)
 ~~~~~~~~~~~~~~~~~~~

 Added
@@ -73,7 +63,7 @@ Added
  consistent with all other cameras (equal to type "depth").


-0.22.5 (2024-08-29)
+0.23.4 (2024-09-02)
 ~~~~~~~~~~~~~~~~~~~

 Fixed
@@ -84,7 +74,7 @@ Fixed
 * Added test to check :attr:`omni.isaac.lab.sensors.RayCasterCamera.set_intrinsic_matrices`


-0.22.4 (2024-08-29)
+0.23.3 (2024-08-29)
 ~~~~~~~~~~~~~~~~~~~

 Fixed
@@ -95,7 +85,7 @@ Fixed
  which required initialization of the class to call the class-methods.


-0.22.3 (2024-08-28)
+0.23.2 (2024-08-28)
 ~~~~~~~~~~~~~~~~~~~

 Added
@@ -116,7 +106,7 @@ Fixed
  the behavior equal to the USD Camera.


-0.22.2 (2024-08-21)
+0.23.1 (2024-08-21)
 ~~~~~~~~~~~~~~~~~~~

 Changed
@@ -125,6 +115,15 @@ Changed
 * Disabled default viewport in certain headless scenarios for better performance.


+0.23.0 (2024-08-17)
+~~~~~~~~~~~~~~~~~~~
+
+Added
+^^^^^
+
+* Added direct workflow base class :class:`omni.isaac.lab.envs.DirectMARLEnv` for multi-agent environments.
+
+
 0.22.1 (2024-08-17)
 ~~~~~~~~~~~~~~~~~~~

@@ -140,7 +139,7 @@ Added
 ~~~~~~~~~~~~~~~~~~~

 Added
-^^^^^^^
+^^^^^

 * Added :mod:`~omni.isaac.lab.utils.modifiers` module to provide framework for configurable and custom
  observation data modifiers.

--- a/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/__init__.py
+++ b/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/__init__.py
@@ -20,7 +20,9 @@ There are two types of environment designing workflows:
 * **Direct**: The user implements all the necessary functionality directly into a single class
  directly without the need for additional managers.

-Based on these workflows, there are the following environment classes:
+Based on these workflows, there are the following environment classes for single and multi-agent RL:
+
+**Single-Agent RL:**

 * :class:`ManagerBasedEnv`: The manager-based workflow base environment which only provides the
  agent with the current observations and executes the actions provided by the agent.
@@ -30,6 +32,11 @@ Based on these workflows, there are the following environment classes:
 * :class:`DirectRLEnv`: The direct workflow RL task environment which provides implementations for
  implementing scene setup, computing dones, performing resets, and computing reward and observation.

+**Multi-Agent RL (MARL):**
+
+* :class:`DirectMARLEnv`: The direct workflow MARL task environment which provides implementations for
+  implementing scene setup, computing dones, performing resets, and computing reward and observation.
+
 For more information about the workflow design patterns, see the `Task Design Workflows`_ section.

 .. _`Task Design Workflows`: https://isaac-sim.github.io/IsaacLab/source/features/task_workflows.html
@@ -37,9 +44,12 @@ For more information about the workflow design patterns, see the `Task Design Wo

 from . import mdp, ui
 from .common import VecEnvObs, VecEnvStepReturn, ViewerCfg
+from .direct_marl_env import DirectMARLEnv
+from .direct_marl_env_cfg import DirectMARLEnvCfg
 from .direct_rl_env import DirectRLEnv
 from .direct_rl_env_cfg import DirectRLEnvCfg
 from .manager_based_env import ManagerBasedEnv
 from .manager_based_env_cfg import ManagerBasedEnvCfg
 from .manager_based_rl_env import ManagerBasedRLEnv
 from .manager_based_rl_env_cfg import ManagerBasedRLEnvCfg
+from .utils import multi_agent_to_single_agent, multi_agent_with_one_agent
--- a/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/common.py
+++ b/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/common.py
@@ -6,7 +6,7 @@
 from __future__ import annotations

 import torch
-from typing import Dict, Literal
+from typing import Dict, Literal, TypeVar

 from omni.isaac.lab.utils import configclass

@@ -96,3 +96,40 @@ The tuple contains batched information for each sub-environment. The information
 4. **Timeout Dones**: Whether the environment reached a timeout state, such as end of max episode length.
 5. **Extras**: A dictionary containing additional information from the environment.
 """
+
+AgentID = TypeVar("AgentID")
+"""Unique identifier for an agent within a multi-agent environment.
+
+The identifier has to be an immutable object, typically a string (e.g.: ``"agent_0"``).
+"""
+
+ObsType = TypeVar("ObsType", torch.Tensor, Dict[str, torch.Tensor])
+"""A sentinel object to indicate the data type of the observation.
+"""
+
+ActionType = TypeVar("ActionType", torch.Tensor, Dict[str, torch.Tensor])
+"""A sentinel object to indicate the data type of the action.
+"""
+
+StateType = TypeVar("StateType", torch.Tensor, dict)
+"""A sentinel object to indicate the data type of the state.
+"""
+
+EnvStepReturn = tuple[
+    Dict[AgentID, ObsType],
+    Dict[AgentID, torch.Tensor],
+    Dict[AgentID, torch.Tensor],
+    Dict[AgentID, torch.Tensor],
+    Dict[AgentID, dict],
+]
+"""The environment signals processed at the end of each step.
+
+The tuple contains batched information for each sub-environment (keyed by the agent ID).
+The information is stored in the following order:
+
+1. **Observations**: The observations from the environment.
+2. **Rewards**: The rewards from the environment.
+3. **Terminated Dones**: Whether the environment reached a terminal state, such as task success or robot falling etc.
+4. **Timeout Dones**: Whether the environment reached a timeout state, such as end of max episode length.
+5. **Extras**: A dictionary containing additional information from the environment.
+"""
--- a/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_marl_env.py
+++ b/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_marl_env.py
--- a/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_marl_env_cfg.py
+++ b/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_marl_env_cfg.py
+# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+from dataclasses import MISSING
+
+from omni.isaac.lab.scene import InteractiveSceneCfg
+from omni.isaac.lab.sim import SimulationCfg
+from omni.isaac.lab.utils import configclass
+from omni.isaac.lab.utils.noise import NoiseModelCfg
+
+from .common import AgentID, ViewerCfg
+from .ui import BaseEnvWindow
+
+
+@configclass
+class DirectMARLEnvCfg:
+    """Configuration for a MARL environment defined with the direct workflow.
+
+    Please refer to the :class:`omni.isaac.lab.envs.direct_marl_env.DirectMARLEnv` class for more details.
+    """
+
+    # simulation settings
+    viewer: ViewerCfg = ViewerCfg()
+    """Viewer configuration. Default is ViewerCfg()."""
+
+    sim: SimulationCfg = SimulationCfg()
+    """Physics simulation configuration. Default is SimulationCfg()."""
+
+    # ui settings
+    ui_window_class_type: type | None = BaseEnvWindow
+    """The class type of the UI window. Default is None.
+
+    If None, then no UI window is created.
+
+    Note:
+        If you want to make your own UI window, you can create a class that inherits from
+        from :class:`omni.isaac.lab.envs.ui.base_env_window.BaseEnvWindow`. Then, you can set
+        this attribute to your class type.
+    """
+
+    # general settings
+    decimation: int = MISSING
+    """Number of control action updates @ sim dt per policy dt.
+
+    For instance, if the simulation dt is 0.01s and the policy dt is 0.1s, then the decimation is 10.
+    This means that the control action is updated every 10 simulation steps.
+    """
+
+    is_finite_horizon: bool = False
+    """Whether the learning task is treated as a finite or infinite horizon problem for the agent.
+    Defaults to False, which means the task is treated as an infinite horizon problem.
+
+    This flag handles the subtleties of finite and infinite horizon tasks:
+
+    * **Finite horizon**: no penalty or bootstrapping value is required by the the agent for
+      running out of time. However, the environment still needs to terminate the episode after the
+      time limit is reached.
+    * **Infinite horizon**: the agent needs to bootstrap the value of the state at the end of the episode.
+      This is done by sending a time-limit (or truncated) done signal to the agent, which triggers this
+      bootstrapping calculation.
+
+    If True, then the environment is treated as a finite horizon problem and no time-out (or truncated) done signal
+    is sent to the agent. If False, then the environment is treated as an infinite horizon problem and a time-out
+    (or truncated) done signal is sent to the agent.
+
+    Note:
+        The base :class:`ManagerBasedRLEnv` class does not use this flag directly. It is used by the environment
+        wrappers to determine what type of done signal to send to the corresponding learning agent.
+    """
+
+    episode_length_s: float = MISSING
+    """Duration of an episode (in seconds).
+
+    Based on the decimation rate and physics time step, the episode length is calculated as:
+
+    .. code-block:: python
+
+        episode_length_steps = ceil(episode_length_s / (decimation_rate * physics_time_step))
+
+    For example, if the decimation rate is 10, the physics time step is 0.01, and the episode length is 10 seconds,
+    then the episode length in steps is 100.
+    """
+
+    # environment settings
+    scene: InteractiveSceneCfg = MISSING
+    """Scene settings.
+
+    Please refer to the :class:`omni.isaac.lab.scene.InteractiveSceneCfg` class for more details.
+    """
+
+    events: object = None
+    """Event settings. Defaults to None, in which case no events are applied through the event manager.
+
+    Please refer to the :class:`omni.isaac.lab.managers.EventManager` class for more details.
+    """
+
+    num_observations: dict[AgentID, int] = MISSING
+    """The dimension of the observation space from each agent."""
+
+    num_states: int = MISSING
+    """The dimension of the state space from each environment instance.
+
+    The following values are supported:
+
+    * -1: All the observations from the different agents are automatically concatenated.
+    * 0: No state-space will be constructed (`state_space` is None).
+      This is useful to save computational resources when the algorithm to be trained does not need it.
+    * greater than 0: Custom state-space dimension to be provided by the task implementation.
+    """
+
+    observation_noise_model: dict[AgentID, NoiseModelCfg | None] | None = None
+    """The noise model to apply to the computed observations from the environment. Default is None, which means no noise is added.
+
+    Please refer to the :class:`omni.isaac.lab.utils.noise.NoiseModel` class for more details.
+    """
+
+    num_actions: dict[AgentID, int] = MISSING
+    """The dimension of the action space for each agent."""
+
+    action_noise_model: dict[AgentID, NoiseModelCfg | None] | None = None
+    """The noise model applied to the actions provided to the environment. Default is None, which means no noise is added.
+
+    Please refer to the :class:`omni.isaac.lab.utils.noise.NoiseModel` class for more details.
+    """
+
+    possible_agents: list[AgentID] = MISSING
+    """A list of all possible agents the environment could generate.
+
+    The contents of the list cannot be modified during the entire training process.
+    """
--- a/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/utils.py
+++ b/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/utils.py
--- a/source/extensions/omni.isaac.lab/setup.py
+++ b/source/extensions/omni.isaac.lab/setup.py
@@ -25,7 +25,7 @@ INSTALL_REQUIRES = [
    "toml",
    # devices
    "hidapi",
-    # gym
+    # reinforcement learning
    "gymnasium==0.29.0",
    # procedural-generation
    "trimesh",

--- a/source/extensions/omni.isaac.lab/test/envs/test_direct_marl_env.py
+++ b/source/extensions/omni.isaac.lab/test/envs/test_direct_marl_env.py
+# Copyright (c) 2022-2024, The Isaac Lab Project Developers.
+# All rights reserved.
+#
+# SPDX-License-Identifier: BSD-3-Clause
+
+# ignore private usage of variables warning
+# pyright: reportPrivateUsage=none
+
+from __future__ import annotations
+
+"""Launch Isaac Sim Simulator first."""
+
+from omni.isaac.lab.app import AppLauncher, run_tests
+
+# Can set this to False to see the GUI for debugging
+HEADLESS = True
+
+# launch omniverse app
+app_launcher = AppLauncher(headless=HEADLESS)
+simulation_app = app_launcher.app
+
+"""Rest everything follows."""
+
+import torch
+import unittest
+
+import omni.usd
+
+from omni.isaac.lab.envs import DirectMARLEnv, DirectMARLEnvCfg
+from omni.isaac.lab.scene import InteractiveSceneCfg
+from omni.isaac.lab.utils import configclass
+
+
+@configclass
+class EmptySceneCfg(InteractiveSceneCfg):
+    """Configuration for an empty scene."""
+
+    pass
+
+
+def get_empty_base_env_cfg(device: str = "cuda:0", num_envs: int = 1, env_spacing: float = 1.0):
+    """Generate base environment config based on device"""
+
+    @configclass
+    class EmptyEnvCfg(DirectMARLEnvCfg):
+        """Configuration for the empty test environment."""
+
+        # Scene settings
+        scene: EmptySceneCfg = EmptySceneCfg(num_envs=num_envs, env_spacing=env_spacing)
+        # Basic settings
+        decimation = 1
+        possible_agents = ["agent_0", "agent_1"]
+        num_actions = {"agent_0": 1, "agent_1": 2}
+        num_observations = {"agent_0": 3, "agent_1": 4}
+        num_states = -1
+
+    return EmptyEnvCfg()
+
+
+class TestDirectMARLEnv(unittest.TestCase):
+    """Test for direct MARL env class"""
+
+    """
+    Tests
+    """
+
+    def test_initialization(self):
+        for device in ("cuda:0", "cpu"):
+            with self.subTest(device=device):
+                # create a new stage
+                omni.usd.get_context().new_stage()
+                # create environment
+                env = DirectMARLEnv(cfg=get_empty_base_env_cfg(device=device))
+                # check multi-agent config
+                self.assertEqual(env.num_agents, 2)
+                self.assertEqual(env.max_num_agents, 2)
+                # check spaces
+                self.assertEqual(env.state_space.shape, (7,))
+                self.assertEqual(len(env.observation_spaces), 2)
+                self.assertEqual(len(env.action_spaces), 2)
+                # step environment to verify setup
+                env.reset()
+                for _ in range(2):
+                    actions = {"agent_0": torch.rand((1, 1)), "agent_1": torch.rand((1, 2))}
+                    obs, reward, terminated, truncate, info = env.step(actions)
+                    env.state()
+                # close the environment
+                env.close()
+
+
+if __name__ == "__main__":
+    run_tests()