码涯-AIGC代码仓库-openoker/ray: 一个针对强化学习和深度学习所设计的大规模分布式计算框架。

simonsays1980 fbf24a0d2e [RLlib; Offline RL] Validate episodes before adding them to the buffer. (#48083)		1 day ago
..
algorithms	fbf24a0d2e [RLlib; Offline RL] Validate episodes before adding them to the buffer. (#48083)	1 day ago
benchmarks	a62d4bfd3a [RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLModuleConfig; cleanups, DefaultModelConfig dataclass). (#47908)	1 week ago
connectors	a62d4bfd3a [RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLModuleConfig; cleanups, DefaultModelConfig dataclass). (#47908)	1 week ago
core	3d8ea28f53 [RLlib] Add framework-check to `MultiRLModule.add_module()`. (#47973)	1 week ago
env	00829fee16 [RLlib] Fix small bug in 'InfiniteLookBackBuffer.get_state/from_state'. (#47914)	1 week ago
evaluation	65ab047a0c [RLlib] Fix 2 broken CI tests. (#47993)	4 days ago
examples	8c28fe265a [RLlib] Cleanup examples folder vol. 23: Add example script for custom metrics on `EnvRunners` (using `MetricsLogger` API). (#47969)	2 days ago
execution	ed5b3821b2 [RLlib] Add "shuffle batch per epoch" option. (#47458)	1 month ago
models	a62d4bfd3a [RLlib] New API stack: (Multi)RLModule overhaul vol 04 (deprecate RLModuleConfig; cleanups, DefaultModelConfig dataclass). (#47908)	1 week ago
offline	fbf24a0d2e [RLlib; Offline RL] Validate episodes before adding them to the buffer. (#48083)	1 day ago
policy	c9fa046438 [RLlib] Discontinue support for "hybrid" API stack (using RLModule + Learner, but still on RolloutWorker and Policy) (#46085)	3 weeks ago
tests	e0bcdd4cff [RLlib; fault-tolerance] Fix spot node preemption problem (RLlib does not catch correct `ObjectLostError`). (#47940)	1 week ago
tuned_examples	fbf24a0d2e [RLlib; Offline RL] Validate episodes before adding them to the buffer. (#48083)	1 day ago
utils	8c28fe265a [RLlib] Cleanup examples folder vol. 23: Add example script for custom metrics on `EnvRunners` (using `MetricsLogger` API). (#47969)	2 days ago
BUILD	fbf24a0d2e [RLlib; Offline RL] Validate episodes before adding them to the buffer. (#48083)	1 day ago
README.rst	5c00c92e3b [RLlib] Remove CLI from docs (soon to be deprecated and replaced by python API). (#46724)	3 months ago
__init__.py	0c69020432 Revert "Simplify logging configuration. (#30863)" (#31858)	1 year ago
asv.conf.json	5d7afe8092 [rllib] Try moving RLlib to top level dir (#5324)	5 years ago
common.py	a5f371f0f6 [RLlib] Cleanup metrics-spam in algo's result dicts (retire "sampler_results", don't dump env-runner-results into top-level). (#45378)	5 months ago
evaluate.py	710f557308 [RLlib] Cleanup, rename, clarify: Algorithm.workers/evaluation_workers, local_worker(), etc.. (#46726)	3 months ago
scripts.py	5c00c92e3b [RLlib] Remove CLI from docs (soon to be deprecated and replaced by python API). (#46724)	3 months ago
train.py	f8e59cba73 [RLlib] Do-over of release tests in light of rllib_contrib AND new- vs old API stack. (#43278)	7 months ago

		
			
			
				README.rst
			
		
		
	
			
				RLlib: Industry-Grade Reinforcement Learning with TF and Torch
==============================================================

**RLlib** is an open-source library for reinforcement learning (RL), offering support for
production-level, highly distributed RL workloads, while maintaining
unified and simple APIs for a large variety of industry applications.

Whether you would like to train your agents in **multi-agent** setups,
purely from **offline** (historic) datasets, or using **externally
connected simulators**, RLlib offers simple solutions for your decision making needs.

If you either have your problem coded (in python) as an 
`RL environment `_
or own lots of pre-recorded, historic behavioral data to learn from, you will be
up and running in only a few days.

RLlib is already used in production by industry leaders in many different verticals, such as
`climate control `_,
`industrial control `_,
`manufacturing and logistics `_,
`finance `_,
`gaming `_,
`automobile `_,
`robotics `_,
`boat design `_,
and many others.

You can also read about `RLlib Key Concepts. `_


Installation and Setup
----------------------

Install RLlib and run your first experiment on your laptop in seconds:

**PyTorch:**

.. code-block:: bash

    $ conda create -n rllib python=3.11
    $ conda activate rllib
    $ pip install "ray[rllib]" torch "gymnasium[atari]" "gymnasium[accept-rom-license]" atari_py
    $ # Run a test job (assuming you are in the `ray` pip-installed directory):
    $ cd rllib/examples/inference/
    $ python policy_inference_after_training.py --stop-reward=100.0


Algorithms Supported
----------------------

Model-free On-policy RL:

- `Synchronous Proximal Policy Optimization (APPO) `__ 
- `Proximal Policy Optimization (PPO) `__
- `Importance Weighted Actor-Learner Architecture (IMPALA) `__   

Model-free Off-policy RL:

- `Deep Q Networks (DQN, Rainbow, Parametric DQN) `__
- `Soft Actor Critic (SAC) `__

Model-based RL: 

- `DreamerV3 `__

Offline RL:

- `Behavior Cloning (BC; derived from MARWIL implementation) `__
- `Conservative Q-Learning (CQL) `__
- `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) `__

Multi-agent:  

- `Parameter Sharing `__ 
- `Shared Critic Methods `__

Others:  

- `Fully Independent Learning `__

A list of all the algorithms can be found `here `__ .  


Quick First Experiment
----------------------

.. testcode::

    import gymnasium as gym
    from ray.rllib.algorithms.ppo import PPOConfig


    # Define your problem using python and Farama-Foundation's gymnasium API:
    class ParrotEnv(gym.Env):
        """Environment in which an agent must learn to repeat the seen observations.

        Observations are float numbers indicating the to-be-repeated values,
        e.g. -1.0, 5.1, or 3.2.

        The action space is always the same as the observation space.

        Rewards are r=-abs(observation - action), for all steps.
        """

        def __init__(self, config):
            # Make the space (for actions and observations) configurable.
            self.action_space = config.get(
                "parrot_shriek_range", gym.spaces.Box(-1.0, 1.0, shape=(1, )))
            # Since actions should repeat observations, their spaces must be the
            # same.
            self.observation_space = self.action_space
            self.cur_obs = None
            self.episode_len = 0

        def reset(self, *, seed=None, options=None):
            """Resets the episode and returns the initial observation of the new one.
            """
            # Reset the episode len.
            self.episode_len = 0
            # Sample a random number from our observation space.
            self.cur_obs = self.observation_space.sample()
            # Return initial observation.
            return self.cur_obs, {}

        def step(self, action):
            """Takes a single step in the episode given `action`

            Returns:
                New observation, reward, done-flag, info-dict (empty).
            """
            # Set `truncated` flag after 10 steps.
            self.episode_len += 1
            terminated = False
            truncated = self.episode_len >= 10
            # r = -abs(obs - action)
            reward = -sum(abs(self.cur_obs - action))
            # Set a new observation (random sample).
            self.cur_obs = self.observation_space.sample()
            return self.cur_obs, reward, terminated, truncated, {}


    # Create an RLlib Algorithm instance from a PPOConfig to learn how to
    # act in the above environment.
    config = (
        PPOConfig()
        .environment(
            # Env class to use (here: our gym.Env sub-class from above).
            env=ParrotEnv,
            # Config dict to be passed to our custom env's constructor.
            env_config={
                "parrot_shriek_range": gym.spaces.Box(-5.0, 5.0, (1, ))
            },
        )
        # Parallelize environment sampling.
        .env_runners(num_env_runners=3)
    )
    # Use the config's `build()` method to construct a PPO object.
    algo = config.build()

    # Train for n iterations and report results (mean episode rewards).
    # Since we have to guess 10 times and the optimal reward is 0.0
    # (exact match between observation and action value),
    # we can expect to reach an optimal episode reward of 0.0.
    for i in range(1):
        results = algo.train()
        print(f"Iter: {i}; avg. return={results['env_runners/episode_return_mean']}")

.. testoutput::
    :options: +MOCK

    Iter: 0; avg. reward=-41.88662799871655


After training, you may want to perform action computations (inference) in your environment.
Below is a minimal example on how to do this. Also
`check out our more detailed examples here `_
(in particular for `normal models `_,
`LSTMs `_,
and `attention nets `_).


.. testcode::

    # Perform inference (action computations) based on given env observations.
    # Note that we are using a slightly simpler env here (-3.0 to 3.0, instead
    # of -5.0 to 5.0!), however, this should still work as the agent has
    # (hopefully) learned to "just always repeat the observation!".
    env = ParrotEnv({"parrot_shriek_range": gym.spaces.Box(-3.0, 3.0, (1, ))})
    # Get the initial observation (some value between -10.0 and 10.0).
    obs, info = env.reset()
    terminated = truncated = False
    total_reward = 0.0
    # Play one episode.
    while not terminated and not truncated:
        # Compute a single action, given the current observation
        # from the environment.
        action = algo.compute_single_action(obs)
        # Apply the computed action in the environment.
        obs, reward, terminated, truncated, info = env.step(action)
        # Sum up rewards for reporting purposes.
        total_reward += reward
    # Report results.
    print(f"Shreaked for 1 episode; total-reward={total_reward}")

.. testoutput::
    :options: +MOCK

    Shreaked for 1 episode; total-reward=-0.001


For a more detailed `"60 second" example, head to our main documentation  `_.


Highlighted Features
--------------------

The following is a summary of RLlib's most striking features (for an in-depth overview,
check out our `documentation `_):

The most **popular deep-learning frameworks**: `PyTorch `_ and `TensorFlow
(tf1.x/2.x static-graph/eager/traced) `_.

**Highly distributed learning**: Our RLlib algorithms (such as our "PPO" or "IMPALA")
allow you to set the ``num_env_runners`` config parameter, such that your workloads can run
on 100s of CPUs/nodes thus parallelizing and speeding up learning.

**Vectorized (batched) and remote (parallel) environments**: RLlib auto-vectorizes
your ``gym.Envs`` via the ``num_envs_per_env_runner`` config. Environment workers can
then batch and thus significantly speedup the action computing forward pass.
On top of that, RLlib offers the ``remote_worker_envs`` config to create
`single environments (within a vectorized one) as ray Actors `_,
thus parallelizing even the env stepping process.

| **Multi-agent RL** (MARL): Convert your (custom) ``gym.Envs`` into a multi-agent one
  via a few simple steps and start training your agents in any of the following fashions:
| 1) Cooperative with `shared `_ or
  `separate `_
  policies and/or value functions.
| 2) Adversarial scenarios using `self-play `_
  and `league-based training `_.
| 3) `Independent learning `_
  of neutral/co-existing agents.


**External simulators**: Don't have your simulation running as a gym.Env in python?
No problem! RLlib supports an external environment API and comes with a pluggable,
off-the-shelve
`client `_/
`server `_
setup that allows you to run 100s of independent simulators on the "outside"
(e.g. a Windows cloud) connecting to a central RLlib Policy-Server that learns
and serves actions. Alternatively, actions can be computed on the client side
to save on network traffic.

**Offline RL and imitation learning/behavior cloning**: You don't have a simulator
for your particular problem, but tons of historic data recorded by a legacy (maybe
non-RL/ML) system? This branch of reinforcement learning is for you!
RLlib's comes with several `offline RL `_
algorithms (*CQL*, *MARWIL*, and *DQfD*), allowing you to either purely
`behavior-clone `_
your existing system or learn how to further improve over it.


In-Depth Documentation
----------------------

For an in-depth overview of RLlib and everything it has to offer, including
hand-on tutorials of important industry use cases and workflows, head over to
our `documentation pages `_.


Cite our Paper
--------------

If you've found RLlib useful for your research, please cite our `paper `_ as follows:

.. code-block::

    @inproceedings{liang2018rllib,
        Author = {Eric Liang and
                  Richard Liaw and
                  Robert Nishihara and
                  Philipp Moritz and
                  Roy Fox and
                  Ken Goldberg and
                  Joseph E. Gonzalez and
                  Michael I. Jordan and
                  Ion Stoica},
        Title = {{RLlib}: Abstractions for Distributed Reinforcement Learning},
        Booktitle = {International Conference on Machine Learning ({ICML})},
        Year = {2018}
    }