05 Nov 20:51

justusschock

5b3bb74

Lightning v2.5.6 Latest

Latest

Changes in `2.5.6`

PyTorch Lightning

Changed

Add name() function to accelerator interface ((#21325))

Removed

Remove support for deprecated and archived lightning-habana package (#21327)

Assets 8

05 Sep 15:59

Borda

2.5.5

dd0a8f3

Lightning v2.5.5

Changes in `2.5.5`

PyTorch Lightning

Changed

Include exclude_frozen_parameters to DeepSpeedStrategy (#21060)
Include PossibleUserWarning that is raised if modules are in eval mode when training starts (#21146)

Fixed

Fixed LightningCLI not using ckpt_path hyperparameters to instantiate classes (#21116)
Fixed callbacks by defer step/time-triggered ModelCheckpoint saves until validation metrics are available (#21106)
Fixed with adding a missing device id for pytorch 2.8 (#21105)
Fixed TQDMProgressBar not resetting correctly when using both a finite and iterable dataloader (#21147)
Fixed cleanup of temporary files from Tuner on crashes (#21162)

Lightning Fabric

Changed

Include exclude_frozen_parameters to DeepSpeedStrategy (#21060)
Let _get_default_process_group_backend_for_device support more hardware platforms (
#21057, #21093)

Fixed

Fixed with adding a missing device id for pytorch 2.8 (#21105)
Respecting verbose=False in seed_everything when no seed is provided (#21161)

Full commit list: 2.5.4 -> 2.5.5

Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make Lightning ⚡ better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@Borda, @KAVYANSHTYAGI, @littlebullGit, @mauvilsa, @SkafteNicki, @taozhiwei

Thank you ❤️ and we hope you'll keep them coming!

Contributors

taozhiwei, mauvilsa, and 4 other contributors

Assets 8

29 Aug 11:44

Borda

2.5.4

58b89ed

Lightning v2.5.4

Changes in `2.5.4`

PyTorch Lightning

Fixed

Fixed AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation (#21079)
Fixed AsyncCheckpointIO threadpool exception if calling fit or validate more than one (#20952)
Fixed learning rate not being correctly set after using LearningRateFinder callback (#21068)
Fixed misalignment column while using rich model summary in DeepSpeedstrategy (#21100)
Fixed RichProgressBar crashing when sanity checking using val dataloader with 0 len (#21108)

Lightning Fabric

Changed

Added support for NVIDIA H200 GPUs in get_available_flops (#20913)

Full commit list: 2.5.3 -> 2.5.4

Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make Lightning ⚡ better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@fnhirwa, @GdoongMathew, @jjh42, @littlebullGit, @SkafteNicki

Thank you ❤️ and we hope you'll keep them coming!

Contributors

jjh42, SkafteNicki, and 3 other contributors

Assets 8

13 Aug 20:28

Borda

2.5.3

947f939

Lightning v2.5.3

Notable changes in this release

PyTorch Lightning

Changed

Added save_on_exception option to ModelCheckpoint Callback (#20916)
Allow dataloader_idx_ in log names when add_dataloader_idx=False (#20987)
Allow returning ONNXProgram when calling to_onnx(dynamo=True) (#20811)
Extended support for general mappings being returned from training_step when using manual optimization (#21011)

Fixed

Fixed Allowing trainer to accept CUDAAccelerator instance as accelerator with FSDP strategy (#20964)
Fixed progress bar console clearing for Rich 14.1+ (#21016)
Fixed AdvancedProfiler to handle nested profiling actions for Python 3.12+ (#20809)
Fixed rich progress bar error when resume training (#21000)
Fixed double iteration bug when resumed from a checkpoint. (#20775)
Fixed support for more dtypes in ModelSummary (#21034)
Fixed metrics in RichProgressBar being updated according to user provided refresh_rate (#21032)
Fixed save_last behavior in the absence of validation (#20960)
Fixed integration between LearningRateFinder and EarlyStopping (#21056)
Fixed gradient calculation in lr_finder for mode="exponential" (#21055)
Fixed save_hyperparameters crashing with dataclasses using init=False fields (#21051)

Lightning Fabric

Changed

Enable "auto" for devices and accelerator as CLI arguments (#20913)
Raise ValueError when seed is out-of-bounds or cannot be cast to int (#21029)

Fixed

Fixed remove extra name parameter in accelerator registry decorator (#20975)
Fixed XLA strategy to add support for global_ordinal, local_ordinal, world_size which came instead of deprecated methods (#20852)

Full commit list: 2.5.2 -> 2.5.3

Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make Lightning ⚡ better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@baskrahmer, @bhimrazy, @deependujha, @fnhirwa, @GdoongMathew, @jonathanking, @relativityhd, @rittik9, @SkafteNicki, @sudiptob2, @vsey, @YgLK

Thank you ❤️ and we hope you'll keep them coming!

Contributors

jonathanking, baskrahmer, and 10 other contributors

Assets 8

20 Jun 15:56

Borda

2.5.2

1617f70

Lightning v2.5.2

Notable changes in this release

PyTorch Lightning

Changed

Add enable_autolog_hparams argument to Trainer (#20593)
Add toggled_optimizer(optimizer) method to the LightningModule, which is a context manager version of toggle_optimize and untoggle_optimizer (#20771)
For cross-device local checkpoints, instruct users to install fsspec>=2025.5.0 if unavailable (#20780)
Check param is of nn.Parameter type for pruning sanitization (#20783)

Fixed

Fixed save_hyperparameters not working correctly with LightningCLI when there are parsing links applied on instantiation (#20777)
Fixed logger_connector has an edge case where step can be a float (#20692)
Fixed Synchronize SIGTERM Handling in DDP to Prevent Deadlocks (#20825)
Fixed case-sensitive model name (#20661)
CLI: resolve jsonargparse deprecation warning (#20802)
Fix: move check_inputs to the target device if available during to_torchscript (#20873)
Fixed progress bar display to correctly handle iterable dataset and max_steps during training (#20869)
Fixed problem for silently supporting jsonnet (#20899)

Lightning Fabric

Changed

Ensure correct device is used for autocast when mps is selected as Fabric accelerator (#20876)

Removed

Fix: TransformerEnginePrecision conversion for layers with bias=False (#20805)

Full commit list: 2.5.1 -> 2.5.2

Contributors

We thank all folks who submitted issues, features, fixes, and doc changes. It's the only way we can collectively make Lightning ⚡ better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@adamjstewart, @Armannas, @bandpooja, @Borda, @chanokin, @duydl, @GdoongMathew, @KAVYANSHTYAGI, @mauvilsa, @muthissar, @rustamzh, @siemdejong

Thank you ❤️ and we hope you'll keep them coming!

Contributors

chanokin, mauvilsa, and 10 other contributors

Assets 8

25 Apr 20:23

Borda

2.5.1.post0

638ee9e

Lightning v2.5.1.post

Full Changelog: 2.5.1...2.5.1.post0

Assets 8

19 Mar 20:26

Borda

2.5.1

878ecf5

Lightning v2.5.1

Changes

PyTorch Lightning

Changed

Allow LightningCLI to use a customized argument parser class (#20596)
Change wandb default x-axis to tensorboard's global_step when sync_tensorboard=True (#20611)
Added a new checkpoint_path_prefix parameter to the MLflow logger which can control the path to where the MLflow artifacts for the model checkpoints are stored (#20538)
CometML logger was updated to support the recent Comet SDK (#20275)
bump: testing with latest torch 2.6 (#20509)

Fixed

Fixed CSVLogger logging hyperparameter at every write which increases latency (#20594)
Fixed OverflowError when resuming from checkpoint with an iterable dataset (#20565)
Fixed swapped _R_co and _P to prevent type error (#20508)
Always call WandbLogger.experiment first in _call_setup_hook to ensure tensorboard logs can sync to wandb (#20610)
Fixed TBPTT example (#20528)
Fixed test compatibility as AdamW became a subclass of Adam (#20574)
Fixed file extension of model checkpoints uploaded by NeptuneLogger (#20581)
Reset trainer variable should_stop when fit is called (#19177)
Fixed making WandbLogger upload models from all ModelCheckpoint callbacks, not just one (#20191)
Error when logging to MLFlow deleted experiment (#20556)

Lightning Fabric

Changed

Added logging support for a list of dicts without collapsing to a single key (#19957)
bump: testing with latest torch 2.6 (#20509)

Removed

Removed legacy support for lightning run model; use fabric run instead. (#20588)

Full commit list: 2.5.0 -> 2.5.1

Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make Lightning ⚡ better for everyone, nice job!

In particular, we would like to thank the authors of the pull-requests above, in no particular order:

@benglewis, @Borda, @cgebbe, @duydl, @haifeng-jin, @japdubengsub, @justusschock, @lantiga, @mauvilsa, @millskyle, @ringohoffman, @ryan597, @senarvi, @TresYap

Thank you ❤️ and we hope you'll keep them coming!

Contributors

lantiga, senarvi, and 12 other contributors

Assets 8

21 Dec 01:35

lantiga

2.5.0.post0

9177ec0

Lightning v2.5 post0

Full Changelog: 2.5.0...2.5.0.post0

Assets 8

20 Dec 14:20

lantiga

2.5.0

c45c3c9

Lightning v2.5

Lightning AI ⚡ is excited to announce the release of Lightning 2.5.

Lightning 2.5 comes with improvements on several fronts, with zero API changes. Our users love it stable, we keep it stable 😄.

Talking about love ❤️, the lightning, pytorch-lightning and lightning-fabric packages are collectively getting more than 10M downloads per month 😮, for a total of over 180M downloads 🤯 since the early days . It's incredible to see PyTorch Lightning getting such a strong adoption across the industry and the sciences.

Release 2.5 embraces PyTorch 2.5, and it marks some of its more recent directions as officially supported, namely tensor subclass-based APIs like Distributed Tensors and TorchAO, in combination with torch.compile.

Here's a couple of examples:

Distributed FP8 transformer with PyTorch Lightning

Full example here

import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
from lightning.pytorch.demos import Transformer, WikiText2
from lightning.pytorch.strategies import ModelParallelStrategy
from torch.distributed._composable.fsdp.fully_shard import fully_shard
from torch.utils.data import DataLoader
from torchao.float8 import Float8LinearConfig, convert_to_float8_training

class LanguageModel(L.LightningModule):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.model = None

    def configure_model(self):
        if self.model is not None:
            return

        with torch.device("meta"):
            model = Transformer(
                vocab_size=self.vocab_size,
                nlayers=16,
                nhid=4096,
                ninp=1024,
                nhead=32,
            )

        float8_config = Float8LinearConfig(
            # pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly  # noqa
            pad_inner_dim=True,
        )

        def module_filter_fn(mod: torch.nn.Module, fqn: str):
            # we skip the decoder because it typically vocabulary size
            # is not divisible by 16 as required by float8
            return fqn != "decoder"

        convert_to_float8_training(model, config=float8_config, module_filter_fn=module_filter_fn)

        for module in model.modules():
            if isinstance(module, (nn.TransformerEncoderLayer, nn.TransformerDecoderLayer)):
                fully_shard(module, mesh=self.device_mesh)

        fully_shard(model, mesh=self.device_mesh)

        self.model = torch.compile(model)

    def training_step(self, batch):
        input, target = batch
        output = self.model(input, target)
        loss = F.nll_loss(output, target.view(-1))
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-4)

def train():
    L.seed_everything(42)

    dataset = WikiText2()
    train_dataloader = DataLoader(dataset, num_workers=8, batch_size=1)

    model = LanguageModel(vocab_size=dataset.vocab_size)

    mp_strategy = ModelParallelStrategy(
        data_parallel_size=4,
        tensor_parallel_size=1,
    )

    trainer = L.Trainer(strategy=mp_strategy, max_steps=100, precision="bf16-true", accumulate_grad_batches=8)

    trainer.fit(model, train_dataloader)

    trainer.print(torch.cuda.memory_summary())

if __name__ == "__main__":
    torch.set_float32_matmul_precision("high")

    train()

Distributed FP8 transformer with Fabric

Full example here

import lightning as L
import torch
import torch.nn as nn
import torch.nn.functional as F
from lightning.fabric.strategies import ModelParallelStrategy
from lightning.pytorch.demos import Transformer, WikiText2
from torch.distributed._composable.fsdp.fully_shard import fully_shard
from torch.distributed.device_mesh import DeviceMesh
from torch.utils.data import DataLoader
from torchao.float8 import Float8LinearConfig, convert_to_float8_training
from tqdm import tqdm

def configure_model(model: nn.Module, device_mesh: DeviceMesh) -> nn.Module:
    float8_config = Float8LinearConfig(
        # pip install -U --index-url <https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/> triton-nightly  # noqa
        pad_inner_dim=True,
    )

    def module_filter_fn(mod: torch.nn.Module, fqn: str):
        # we skip the decoder because it typically vocabulary size
        # is not divisible by 16 as required by float8
        return fqn != "decoder"

    convert_to_float8_training(model, config=float8_config, module_filter_fn=module_filter_fn)

    for module in model.modules():
        if isinstance(module, (torch.nn.TransformerEncoderLayer, torch.nn.TransformerDecoderLayer)):
            fully_shard(module, mesh=device_mesh)

    fully_shard(model, mesh=device_mesh)

    return torch.compile(model)

def train():
    L.seed_everything(42)

    batch_size = 8
    micro_batch_size = 1

    max_steps = 100

    dataset = WikiText2()
    dataloader = DataLoader(dataset, num_workers=8, batch_size=micro_batch_size)

    with torch.device("meta"):
        model = Transformer(
            vocab_size=dataset.vocab_size,
            nlayers=16,
            nhid=4096,
            ninp=1024,
            nhead=32,
        )

    strategy = ModelParallelStrategy(data_parallel_size=4, tensor_parallel_size=1, parallelize_fn=configure_model)

    fabric = L.Fabric(precision="bf16-true", strategy=strategy)
    fabric.launch()

    model = fabric.setup(model)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    optimizer = fabric.setup_optimizers(optimizer)

    dataloader = fabric.setup_dataloaders(dataloader)

    iterable = tqdm(enumerate(dataloader), total=len(dataloader)) if fabric.is_global_zero else enumerate(dataloader)

    steps = 0

    for i, batch in iterable:
        input, target = batch

        is_accumulating = i % (batch_size // micro_batch_size) != 0

        with fabric.no_backward_sync(model, enabled=is_accumulating):
            output = model(input, target)
            loss = F.nll_loss(output, target.view(-1))
            fabric.backward(loss)

        if not is_accumulating:
            fabric.clip_gradients(model, optimizer, max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad()
            steps += 1

        if fabric.is_global_zero:
            iterable.set_postfix_str(f"train_loss={loss.item():.2f}")

        if steps == max_steps:
            break

    fabric.print(torch.cuda.memory_summary())

if __name__ == "__main__":
    torch.set_float32_matmul_precision("high")

    train()

As these examples show, it's now easier than ever to take your PyTorch Lightning module and run it with FSDP2 and/or tensor parallelism in FP8 precision, using the ModelParallelStrategy we introduced in 2.4.

Also note the use of distributed tensor APIs, TorchAO APIs, and torch.compile directly in the configure_model hook (or in the parallelize function in Fabric's ModelParallelStrategy), as opposed to the LightningModule as a whole. The advantage with this approach is that you can just copy-paste the parallelize functions that come with native PyTorch models directly in configure_model and get the same effect, no head-scratching involved 🤓.

Talking about head scratching, we also made a pass at the PyTorch Lightning internals and hardened the parts where we keep track of progress counters during training, validation, testing, as well as learning rate scheduling, in relation to resuming from checkpoints. We now made sure there are no (to the best of our knowledge) edge cases where stopping and resuming from checkpoints can change the sequence of loops or other internal states. Fault tolerance for the win 🥳!

Alright! Feel free to take a look at the full changelog below.

And of course: the best way to use PyTorch Lightning and Fabric is through Lightning Studio ⚡. Access GPUs, train models, deploy and more with zero setup. Focus on data and models - not infrastructure.

Changes

PyTorch Lightning

Added

Added step parameter to TensorBoardLogger.log_hyperparams to visualize changes during training (#20176)
Added str method to datamodule (#20301)
Added timeout to DeepSpeedStrategy (#20474)
Added doc for Truncated Back-Propagation Through Time (#20422)
Added FP8 + FSDP2 + torch.compile examples for PyTorch Lightning (#20440)
Added profiling to Trainer.save_checkpoint (#20405)
Added after_instantiate_classes hook to CLI (#20401)