simonsays1980 94ed6f99a3 [RLlib] BC RLModule. (#39542) 1 year ago
..
algorithms 94ed6f99a3 [RLlib] BC RLModule. (#39542) 1 year ago
benchmarks bcb6c3fa71 [RLlib][doc] Added torch compile docs (#37252) 1 year ago
connectors 6f88b0c674 [RLlib] Bump Gymnasium to 0.28.1 (#35698) 1 year ago
core 94ed6f99a3 [RLlib] BC RLModule. (#39542) 1 year ago
env 1fbb143950 [RLlib] Issue 39453: PettingZoo wrappers should use correct multi-agent dict action- and observation spaces. (#39459) 1 year ago
evaluation 82badad0d3 [RLlib] DreamerV3: Issue 39751 and add module compilation tests back in. (#39786) 1 year ago
examples 1fbb143950 [RLlib] Issue 39453: PettingZoo wrappers should use correct multi-agent dict action- and observation spaces. (#39459) 1 year ago
execution 0f62ccc6ee [RLlib] Clean up some deprecation messages (they shouldn't be there) and make others `error=True` (from `error=False`) (#38555) 1 year ago
models 89ac80d883 [RLlib] Issue 39421: MultiDiscrete action spaces not supported on new stack. (#39534) 1 year ago
offline df8dd535e8 [RLlib] Issue 35440: JSON output writer should include INFOs. (#39632) 1 year ago
policy 2913e9b971 [train] Legacy interface cleanup (`air.Checkpoint`, `LegacyExperimentAnalysis`) (#39289) 1 year ago
tests df8dd535e8 [RLlib] Issue 35440: JSON output writer should include INFOs. (#39632) 1 year ago
tuned_examples 94ed6f99a3 [RLlib] BC RLModule. (#39542) 1 year ago
utils 2ffd7e49bd [rllib] Fix storage-path related tests (#38947) 1 year ago
BUILD 32c73a319d [RLlib] Issue 39031: SlateQ example script bug. (#39550) 1 year ago
README.rst f2b6a6bace [RLlib; docs] Change links and references in code and docs to "Farama foundation's gymnasium" (from "OpenAI gym"). (#32061) 1 year ago
__init__.py 0c69020432 Revert "Simplify logging configuration. (#30863)" (#31858) 1 year ago
asv.conf.json 5d7afe8092 [rllib] Try moving RLlib to top level dir (#5324) 5 years ago
common.py 6c4dbae359 [RLlib] DreamerV3: `rllib train` and README.md fixes. (#38259) 1 year ago
evaluate.py 2ffd7e49bd [rllib] Fix storage-path related tests (#38947) 1 year ago
scripts.py 0f62ccc6ee [RLlib] Clean up some deprecation messages (they shouldn't be there) and make others `error=True` (from `error=False`) (#38555) 1 year ago
train.py 2ffd7e49bd [rllib] Fix storage-path related tests (#38947) 1 year ago

README.rst

RLlib: Industry-Grade Reinforcement Learning with TF and Torch
==============================================================

**RLlib** is an open-source library for reinforcement learning (RL), offering support for
production-level, highly distributed RL workloads, while maintaining
unified and simple APIs for a large variety of industry applications.

Whether you would like to train your agents in **multi-agent** setups,
purely from **offline** (historic) datasets, or using **externally
connected simulators**, RLlib offers simple solutions for your decision making needs.

If you either have your problem coded (in python) as an
`RL environment `_
or own lots of pre-recorded, historic behavioral data to learn from, you will be
up and running in only a few days.

RLlib is already used in production by industry leaders in many different verticals, such as
`climate control `_,
`industrial control `_,
`manufacturing and logistics `_,
`finance `_,
`gaming `_,
`automobile `_,
`robotics `_,
`boat design `_,
and many others.

You can also read about `RLlib Key Concepts. `_


Installation and Setup
----------------------

Install RLlib and run your first experiment on your laptop in seconds:

**TensorFlow:**

.. code-block:: bash

$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install "ray[rllib]" tensorflow "gym[atari]" "gym[accept-rom-license]" atari_py
$ # Run a test job:
$ rllib train --run APPO --env CartPole-v0


**PyTorch:**

.. code-block:: bash

$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install "ray[rllib]" torch "gym[atari]" "gym[accept-rom-license]" atari_py
$ # Run a test job:
$ rllib train --run APPO --env CartPole-v0 --torch


Algorithms Supported
----------------------

Offline RL:

- `Behavior Cloning (BC; derived from MARWIL implementation) `__
- `Conservative Q-Learning (CQL) `__
- `Critic Regularized Regression (CRR) `__
- `Importance Sampling and Weighted Importance Sampling (OPE) `__
- `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) `__

Model-free On-policy RL:

- `Synchronous Proximal Policy Optimization (APPO) `__
- `Decentralized Distributed Proximal Policy Optimization (DD-PPO) `__
- `Proximal Policy Optimization (PPO) `__
- `Importance Weighted Actor-Learner Architecture (IMPALA) `__
- `Advantage Actor-Critic (A2C, A3C) `__
- `Vanilla Policy Gradient (PG) `__
- `Model-agnostic Meta-Learning (MAML) `__

Model-free Off-policy RL:

- `Distributed Prioritized Experience Replay (Ape-X DQN, Ape-X DDPG)] `__
- `Recurrent Replay Distributed DQN (R2D2) `__
- `Deep Q Networks (DQN, Rainbow, Parametric DQN) `__
- `Deep Deterministic Policy Gradients (DDPG, TD3) `__
- `Soft Actor Critic (SAC) `__

Model-based RL:

- `Image-only Dreamer (Dreamer) `__
- `Model-Based Meta-Policy-Optimization (MB-MPO) `__

Derivative-free algorithms:

- `Augmented Random Search (ARS) `__
- `Evolution Strategies (ES) `__

RL for recommender systems:

- `SlateQ `__

Bandits:

- `Linear Upper Confidence Bound (BanditLinUCB) `__
- `Linear Thompson Sampling (BanditLinTS) `__

Multi-agent:

- `Parameter Sharing `__
- `QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)) `__
- `Multi-Agent Deep Deterministic Policy Gradient (MADDPG) `__
- `Shared Critic Methods `__

Others:

- `Single-Player Alpha Zero (AlphaZero) `__
- `Curiosity (ICM: Intrinsic Curiosity Module) `__
- `Random encoders (RE3) `__
- `Fully Independent Learning `__

A list of all the algorithms can be found `here `__ .


Quick First Experiment
----------------------

.. code-block:: python

import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig


# Define your problem using python and Farama-Foundation's gymnasium API:
class ParrotEnv(gym.Env):
"""Environment in which an agent must learn to repeat the seen observations.

Observations are float numbers indicating the to-be-repeated values,
e.g. -1.0, 5.1, or 3.2.

The action space is always the same as the observation space.

Rewards are r=-abs(observation - action), for all steps.
"""

def __init__(self, config):
# Make the space (for actions and observations) configurable.
self.action_space = config.get(
"parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(1, )))
# Since actions should repeat observations, their spaces must be the
# same.
self.observation_space = self.action_space
self.cur_obs = None
self.episode_len = 0

def reset(self, *, seed=None, options=None):
"""Resets the episode and returns the initial observation of the new one.
"""
# Reset the episode len.
self.episode_len = 0
# Sample a random number from our observation space.
self.cur_obs = self.observation_space.sample()
# Return initial observation.
return self.cur_obs, {}

def step(self, action):
"""Takes a single step in the episode given `action`

Returns:
New observation, reward, done-flag, info-dict (empty).
"""
# Set `truncated` flag after 10 steps.
self.episode_len += 1
terminated = False
truncated = self.episode_len >= 10
# r = -abs(obs - action)
reward = -sum(abs(self.cur_obs - action))
# Set a new observation (random sample).
self.cur_obs = self.observation_space.sample()
return self.cur_obs, reward, terminated, truncated, {}


# Create an RLlib Algorithm instance from a PPOConfig to learn how to
# act in the above environment.
config = (
PPOConfig()
.environment(
# Env class to use (here: our gym.Env sub-class from above).
env=ParrotEnv,
# Config dict to be passed to our custom env's constructor.
env_config={
"parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
},
)
# Parallelize environment rollouts.
.rollouts(num_rollout_workers=3)
)
# Use the config's `build()` method to construct a PPO object.
algo = config.build()

# Train for n iterations and report results (mean episode rewards).
# Since we have to guess 10 times and the optimal reward is 0.0
# (exact match between observation and action value),
# we can expect to reach an optimal episode reward of 0.0.
for i in range(5):
results = algo.train()
print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")


After training, you may want to perform action computations (inference) in your environment.
Below is a minimal example on how to do this. Also
`check out our more detailed examples here `_
(in particular for `normal models `_,
`LSTMs `_,
and `attention nets `_).


.. code-block:: python

# Perform inference (action computations) based on given env observations.
# Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
# of -5.0 to 5.0!), however, this should still work as the agent has
# (hopefully) learned to "just always repeat the observation!".
env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, (1, ))})
# Get the initial observation (some value between -10.0 and 10.0).
obs, info = env.reset()
terminated = truncated = False
total_reward = 0.0
# Play one episode.
while not terminated and not truncated:
# Compute a single action, given the current observation
# from the environment.
action = algo.compute_single_action(obs)
# Apply the computed action in the environment.
obs, reward, terminated, truncated, info = env.step(action)
# Sum up rewards for reporting purposes.
total_reward += reward
# Report results.
print(f"Shreaked for 1 episode; total-reward={total_reward}")


For a more detailed `"60 second" example, head to our main documentation `_.


Highlighted Features
--------------------

The following is a summary of RLlib's most striking features (for an in-depth overview,
check out our `documentation `_):

The most **popular deep-learning frameworks**: `PyTorch `_ and `TensorFlow
(tf1.x/2.x static-graph/eager/traced) `_.

**Highly distributed learning**: Our RLlib algorithms (such as our "PPO" or "IMPALA")
allow you to set the ``num_workers`` config parameter, such that your workloads can run
on 100s of CPUs/nodes thus parallelizing and speeding up learning.

**Vectorized (batched) and remote (parallel) environments**: RLlib auto-vectorizes
your ``gym.Envs`` via the ``num_envs_per_worker`` config. Environment workers can
then batch and thus significantly speedup the action computing forward pass.
On top of that, RLlib offers the ``remote_worker_envs`` config to create
`single environments (within a vectorized one) as ray Actors `_,
thus parallelizing even the env stepping process.

| **Multi-agent RL** (MARL): Convert your (custom) ``gym.Envs`` into a multi-agent one
via a few simple steps and start training your agents in any of the following fashions:
| 1) Cooperative with `shared `_ or
`separate `_
policies and/or value functions.
| 2) Adversarial scenarios using `self-play `_
and `league-based training `_.
| 3) `Independent learning `_
of neutral/co-existing agents.


**External simulators**: Don't have your simulation running as a gym.Env in python?
No problem! RLlib supports an external environment API and comes with a pluggable,
off-the-shelve
`client `_/
`server `_
setup that allows you to run 100s of independent simulators on the "outside"
(e.g. a Windows cloud) connecting to a central RLlib Policy-Server that learns
and serves actions. Alternatively, actions can be computed on the client side
to save on network traffic.

**Offline RL and imitation learning/behavior cloning**: You don't have a simulator
for your particular problem, but tons of historic data recorded by a legacy (maybe
non-RL/ML) system? This branch of reinforcement learning is for you!
RLlib's comes with several `offline RL `_
algorithms (*CQL*, *MARWIL*, and *DQfD*), allowing you to either purely
`behavior-clone `_
your existing system or learn how to further improve over it.


In-Depth Documentation
----------------------

For an in-depth overview of RLlib and everything it has to offer, including
hand-on tutorials of important industry use cases and workflows, head over to
our `documentation pages `_.


Cite our Paper
--------------

If you've found RLlib useful for your research, please cite our `paper `_ as follows:

.. code-block::

@inproceedings{liang2018rllib,
Author = {Eric Liang and
Richard Liaw and
Robert Nishihara and
Philipp Moritz and
Roy Fox and
Ken Goldberg and
Joseph E. Gonzalez and
Michael I. Jordan and
Ion Stoica},
Title = {{RLlib}: Abstractions for Distributed Reinforcement Learning},
Booktitle = {International Conference on Machine Learning ({ICML})},
Year = {2018}
}