{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "(lightning_experiment_tracking)=\n", "\n", "# Using Experiment Tracking Tools in LightningTrainer\n", "\n", "W&B, CometML, MLFlow, and Tensorboard are all popular tools in the field of machine learning for managing, visualizing, and tracking experiments. The {class}`~ray.train.lightning.LightningTrainer` integration in Ray AIR allows you to continue using these built-in experiment tracking integrations.\n", "\n", "\n", ":::{note}\n", "This guide shows how to use the native [Logger](https://lightning.ai/docs/pytorch/stable/extensions/logging.html) integrations in PyTorch Lightning. Ray AIR also provides {ref}`experiment tracking integrations ` for all the tools mentioned in this example. We recommend sticking with the PyTorch Lightning loggers.\n", ":::\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define your model and dataloader\n", "\n", "In this example, we simply create a dummy model with dummy datasets for demonstration. There is no need for any code change here. We report 3 metrics(\"train_loss\", \"metric_1\", \"metric_2\") in the training loop. Lightning's `Logger`s will capture and report them to the corresponding experiment tracking tools." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import os\n", "import torch\n", "import torch.nn.functional as F\n", "import pytorch_lightning as pl\n", "from torch.utils.data import TensorDataset, DataLoader\n", "\n", "# create dummy data\n", "X = torch.randn(128, 3) # 128 samples, 3 features\n", "y = torch.randint(0, 2, (128,)) # 128 binary labels\n", "\n", "# create a TensorDataset to wrap the data\n", "dataset = TensorDataset(X, y)\n", "\n", "# create a DataLoader to iterate over the dataset\n", "batch_size = 8\n", "dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Define a dummy model\n", "class DummyModel(pl.LightningModule):\n", " def __init__(self):\n", " super().__init__()\n", " self.layer = torch.nn.Linear(3, 1)\n", "\n", " def forward(self, x):\n", " return self.layer(x)\n", "\n", " def training_step(self, batch, batch_idx):\n", " x, y = batch\n", " y_hat = self(x)\n", " loss = F.binary_cross_entropy_with_logits(y_hat.flatten(), y.float())\n", "\n", " # The metrics below will be reported to Loggers\n", " self.log(\"train_loss\", loss)\n", " self.log_dict({\"metric_1\": 1 / (batch_idx + 1), \"metric_2\": batch_idx * 100})\n", " return loss\n", "\n", " def configure_optimizers(self):\n", " return torch.optim.Adam(self.parameters(), lr=1e-3)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define your loggers\n", "\n", "For offline loggers, no changes are required in the Logger initialization.\n", "\n", "For online loggers (W&B and CometML), you need to do two things:\n", "- Set up your API keys as environment variables.\n", "- Set `rank_zero_only.rank = None` to avoid Lightning creating a new experiment run on the driver node. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "CometLogger will be initialized in online mode\n" ] } ], "source": [ "from pytorch_lightning.loggers.wandb import WandbLogger\n", "from pytorch_lightning.loggers.comet import CometLogger\n", "from pytorch_lightning.loggers.mlflow import MLFlowLogger\n", "from pytorch_lightning.loggers.tensorboard import TensorBoardLogger\n", "from pytorch_lightning.utilities.rank_zero import rank_zero_only\n", "import wandb\n", "\n", "\n", "# A callback to login wandb in each worker\n", "class WandbLoginCallback(pl.Callback):\n", " def __init__(self, key):\n", " self.key = key\n", "\n", " def setup(self, trainer, pl_module, stage) -> None:\n", " wandb.login(key=self.key)\n", "\n", "\n", "def create_loggers(name, project_name, save_dir=\"./logs\", offline=False):\n", " # Avoid creating a new experiment run on the driver node.\n", " rank_zero_only.rank = None\n", "\n", " # Wandb\n", " wandb_api_key = os.environ.get(\"WANDB_API_KEY\", None)\n", "\n", " class RayWandbLogger(WandbLogger):\n", " # wandb.finish() ensures all artifacts get uploaded at the end of training.\n", " def finalize(self, status):\n", " super().finalize(status)\n", " wandb.finish()\n", "\n", " wandb_logger = RayWandbLogger(\n", " name=name, \n", " project=project_name, \n", " # Specify a unique id to avoid reporting to a new run after restoration\n", " id=\"unique_id\", \n", " save_dir=f\"{save_dir}/wandb\", \n", " offline=offline\n", " )\n", " callbacks = [] if offline else [WandbLoginCallback(key=wandb_api_key)]\n", "\n", " # CometML\n", " comet_api_key = os.environ.get(\"COMET_API_KEY\", None)\n", " comet_logger = CometLogger(\n", " api_key=comet_api_key,\n", " experiment_name=name,\n", " project_name=project_name,\n", " save_dir=f\"{save_dir}/comet\",\n", " offline=offline,\n", " )\n", "\n", " # MLFlow\n", " mlflow_logger = MLFlowLogger(\n", " run_name=name,\n", " experiment_name=project_name,\n", " tracking_uri=f\"file:{save_dir}/mlflow\",\n", " )\n", "\n", " # Tensorboard\n", " tensorboard_logger = TensorBoardLogger(\n", " name=name, save_dir=f\"{save_dir}/tensorboard\"\n", " )\n", "\n", " return [wandb_logger, comet_logger, mlflow_logger, tensorboard_logger], callbacks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "YOUR_SAVE_DIR = \"./logs\"\n", "loggers, callbacks = create_loggers(\n", " name=\"demo-run\", project_name=\"demo-project\", save_dir=YOUR_SAVE_DIR, offline=False\n", ")" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# FOR SMOKE TESTS\n", "loggers, callbacks = create_loggers(\n", " name=\"demo-run\", project_name=\"demo-project\", offline=True\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Train the model and view logged results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ray.air.config import RunConfig, ScalingConfig\n", "from ray.train.lightning import LightningConfigBuilder, LightningTrainer\n", "\n", "builder = LightningConfigBuilder()\n", "builder.module(cls=DummyModel)\n", "builder.trainer(\n", " max_epochs=5,\n", " accelerator=\"cpu\",\n", " logger=loggers,\n", " callbacks=callbacks,\n", " log_every_n_steps=1,\n", ")\n", "builder.fit_params(train_dataloaders=dataloader)\n", "\n", "lightning_config = builder.build()\n", "\n", "scaling_config = ScalingConfig(num_workers=4, use_gpu=False)\n", "\n", "run_config = RunConfig(\n", " name=\"ptl-exp-tracking\",\n", " storage_path=\"/tmp/ray_results\",\n", ")\n", "\n", "trainer = LightningTrainer(\n", " lightning_config=lightning_config,\n", " scaling_config=scaling_config,\n", " run_config=run_config,\n", ")\n", "\n", "trainer.fit()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at our experiment results!\n", "\n", "**Wandb**\n", "![alt](https://user-images.githubusercontent.com/26745457/235216924-ed27f820-3f2e-4812-bc62-982c3a1748c7.png)\n", "\n", "\n", "**CometML**\n", "![alt](https://user-images.githubusercontent.com/26745457/235216949-72d80d7d-4460-480a-b20d-f154594507fc.png)\n", "\n", "\n", "**Tensorboard**\n", "![](https://user-images.githubusercontent.com/26745457/235227957-7c2ee93b-91ab-494c-a241-7b106cf9a5e6.png)\n", "\n", "**MLFlow**\n", "![](https://user-images.githubusercontent.com/26745457/235241099-6850bcae-8843-4bbb-8268-c04b04a09e68.png)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 4 }