Skip to content

Conversation

NikhilNayak-debug
Copy link

@NikhilNayak-debug NikhilNayak-debug commented Jul 31, 2025

Summary

This PR adds a new parameter-efficient fine-tuning method called Orthogonal Subspace Fine-Tuning (OSF) to the PEFT library. OSF enables continual learning in LLMs by freezing the high-rank subspace of weight matrices and fine-tuning only the low-rank directions. This approach constrains updates to be orthogonal to previously important directions, thereby mitigating catastrophic forgetting without increasing parameter count.


Issue for this PR on PEFT repository

Tracked in PEFT Issue #2648


Key Features

  • Implements a new OSFConfig, OSFModel, and tuner class under src/peft/tuners/osf/ following PEFT's standard API

  • Integrates seamlessly with the get_peft_model API:

    from peft import OSFConfig, get_peft_model
    peft_model = get_peft_model(base_model, OSFConfig(target_modules=[...]))
  • Adds utility functions for:

    • Weight matrix decomposition using SVD
    • Gradient projection onto the low-rank subspace via backward gradient hooks
  • Automatically enforces orthogonality constraints during training without requiring optimizer wrapping

  • Will include tests for saving, loading, and applying the OSF adapter in tests/test_custom_models.py

  • Exports relevant modules at the package level for easier use with other PEFT components


Notes

  • The current implementation does not include layerwise importance-based rank estimation (e.g., cosine similarity of inputs and activations), but can be added in future iterations
  • Merging/unmerging is not supported, as the original weights are decomposed and modified in-place
  • Compared to LoRA, OSF performs a constrained update over the original weight matrix without introducing new trainable parameters, maintaining exact model architecture post-training

Background

This implementation is based on the method described in our paper:
Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning
Paper on arXiv · Project Repository


Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the thorough update, that's a good step forward.
A minor nit: Several files are missing the copyright notice, please make sure to include them in new source files (also make sure that they are not outdated, i.e. include the current year).

I like that you already implemented several (custom) tests, I think that's super helpful. Let's also add some tests to test_decoder_models.py and test_encoder_decoder_models.py similar to the test in test_custom_models.py when you think the implementation can move forward in testing. Let's move the skips for convolutions to testing_common.py, there are already similar exceptions in place.
Two bigger topics:

  1. ModelWithOSF seems to re-invent PEFT functionality inside PEFT, specifically the layer targeting + replacement portion. Let's streamline OSF with other tuners, i.e. have implementations for specific layers and by implementing inject_adapter, _create_new_module and _create_and_replace to make it easier to branch out to other layer types / quantizations. The LoRA implementation maybe helpful, e.g. peft.tuners.lora.layers.LoraLayer contains specific layers for Linear and Conv*d specifics (no need to implement Conv now, of course). I can see that this conflicts with using a dict for specifying the top-k ranks per module. How about using target_modules and a singular value for the topk rank (e.g., config.topk_r) which can default to None (-> uses 50% of min(shape)). Every targeted module gets that topk rank or an automatic 50% one. We could also add something like rank_pattern from LoRA to define exceptions (see lora.model.py -> _create_and_replace). WDYT?
    Example config:
OSFConfig(
  target_modules='all-linear',
  topk_r=None,
  rank_pattern={
    'q_proj': 10,
  }
)
  1. It's not possible to use more than one adapter of OSF since the base model is modified and we therefore cannot switch between adapters (could be handy in pipeline scenarios where one model is used at several places with different adapters, for example). I left a comment at decompose_weight_matrix to discuss this.

Once we're done with the general implementation I think it'd be super if we could add an experiment to the MetaMathQA comparison suite so that we can compare OSF directly to other implementations.

@NikhilNayak-debug
Copy link
Author

Once we're done with the general implementation I think it'd be super if we could add an experiment to the MetaMathQA comparison suite so that we can compare OSF directly to other implementations.

Awesome will definitely evaluate our method once the implementation is complete to benchmark OSF against other methods in PEFT.

@NikhilNayak-debug
Copy link
Author

  1. ModelWithOSF seems to re-invent PEFT functionality inside PEFT, specifically the layer targeting + replacement portion. Let's streamline OSF with other tuners, i.e. have implementations for specific layers and by implementing inject_adapter, _create_new_module and _create_and_replace to make it easier to branch out to other layer types / quantizations. The LoRA implementation maybe helpful, e.g. peft.tuners.lora.layers.LoraLayer contains specific layers for Linear and Conv*d specifics (no need to implement Conv now, of course). I can see that this conflicts with using a dict for specifying the top-k ranks per module. How about using target_modules and a singular value for the topk rank (e.g., config.topk_r) which can default to None (-> uses 50% of min(shape)). Every targeted module gets that topk rank or an automatic 50% one. We could also add something like rank_pattern from LoRA to define exceptions (see lora.model.py -> _create_and_replace). WDYT?
    Example config:
OSFConfig(
  target_modules='all-linear',
  topk_r=None,
  rank_pattern={
    'q_proj': 10,
  }
)

@githubnemo great suggestion in response to the first bigger topic raised I have implemented the minimal PEFT integration changes:

What we implemented:

  • ✅ OSF layer classes (OSFLayer, Linear) similar to LoRA's structure
  • _create_and_replace method for proper layer replacement following PEFT patterns
  • ✅ Updated config to use target_modules and effective_rank (renamed from topk_r)
  • ✅ Added rank_pattern support for per-module rank exceptions, just like LoRA

Scope decisions we made:

  • Only implemented _create_and_replace (not inject_adapter or _create_new_module) since OSF's use case only requires layer replacement as of now
  • Kept existing functionality intact - all SVD decomposition, gradient projection, and hook management preserved as is

Key files changed:

  • src/peft/tuners/osf/layer.py - New OSF layer classes
  • src/peft/tuners/osf/model.py - Added _create_and_replace method
  • src/peft/tuners/osf/config.py - Updated config format
  • src/peft/utils/constants.py - Added TRANSFORMERS_MODELS_TO_OSF_TARGET_MODULES_MAPPING

These changes integrate the OSF method modularly into PEFT.

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback and your changes.

I think that the re-structuring of OSFModel is almost complete and most of the comments are rather minor. As far as I can see the adhoc ModelWithOSF is replaced by OSFModel and OSFLayer and can be removed - good progress!

I think this is a good time remove outdated code, to merge with main, run make style and run the tests to see if there's still something going horribly wrong.

Let's discuss whether we want to implement the importance score now or leave it up for implementation later. If I'm not mistaken I think that the importance score can technically be added later since it would compute the effective rank of layers based on two new hyper-parameters, so in that sense it is modular. Since it is quite a crucial part of the paper and is touted to improve multi-task learning (arguably one of the big selling points of OSF) I wonder if it should be included from the get-go. What's your opinion on that?

Regardless, I think we can a MetaMathQA experiment rather soon and check if there are major problems with memory consumption or runtime.


# OSF (Orthogonal Subspace Fine-tuning)

Orthogonal Subspace Fine-tuning ([OSF](https://arxiv.org/abs/2504.07097)) is a PEFT method designed for continual learning that constrains parameter updates to be orthogonal to previously important directions. This approach enables full fine-tuning while preventing catastrophic forgetting without requiring additional parameters or storing previous gradients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Orthogonal Subspace Fine-tuning ([OSF](https://arxiv.org/abs/2504.07097)) is a PEFT method designed for continual learning that constrains parameter updates to be orthogonal to previously important directions. This approach enables full fine-tuning while preventing catastrophic forgetting without requiring additional parameters or storing previous gradients.
Orthogonal Subspace Fine-tuning ([OSF](https://huggingface.co/papers/2504.07097)) is a PEFT method designed for continual learning that constrains parameter updates to be orthogonal to previously important directions. This approach enables full fine-tuning while preventing catastrophic forgetting without requiring additional parameters or storing previous gradients.


### Best Practices

1. **Effective Rank Selection**: Start with `effective_rank=None` (automatic 50% rank) and adjust based on task complexity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "50% automatic rank" misleading since we're using 50% of the smallest weight dimension which is not necessarily equal to 50% of the rank, right?

# Use with gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

# Apply weight decay selectively
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a short explanation what the intended effect of adding weight decay to the low rank projections ought to have, similar to gradient checkpointing ("memory efficiency")

- Complete continual learning scenario with multiple tasks
- Demonstration of OSF's catastrophic forgetting prevention
- Configuration examples (target_modules, effective_rank, rank_pattern)
- Performance comparison with baseline methods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance comparison with baseline methods - at least for single tasks - is best done in the PEFT method comparison (MetaMathQA). Of course, feel free to provide a comparison with methods for support multi-task learning if it fits into the example without too much effort.

Comment on lines +54 to +68
if isinstance(base_layer, nn.Linear):
in_features, out_features = base_layer.in_features, base_layer.out_features
elif hasattr(base_layer, "infeatures") and hasattr(base_layer, "outfeatures"):
# QuantLinear
in_features, out_features = base_layer.infeatures, base_layer.outfeatures
elif hasattr(base_layer, "input_size") and hasattr(base_layer, "output_size"):
# Megatron ColumnParallelLinear,RowParallelLinear
in_features, out_features = base_layer.input_size, base_layer.output_size
elif hasattr(base_layer, "in_features") and hasattr(base_layer, "out_features"):
in_features, out_features = base_layer.in_features, base_layer.out_features
else:
in_features, out_features = None, None
warnings.warn(
f"Unsupported layer type '{type(base_layer)}' encountered, proceed at your own risk.", UserWarning
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked for megatron but isn't a common (and possibly a more general) attribute the weight parameter which we can use the shape of?

Comment on lines +96 to +106
def unload(self):
raise NotImplementedError("OSF models cannot be unloaded yet")

def merge_adapter(self, *args, **kwargs):
raise NotImplementedError("OSF models do not support merging")

def unmerge_adapter(self, *args, **kwargs):
raise NotImplementedError("OSF models do not support merging")

def merge_and_unload(self, *args, **kwargs):
raise NotImplementedError("OSF models do not support merging")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{merge_and_}unload and {un}merge_adapter are still open, commenting so I dont forget :)


def _mark_only_adapters_as_trainable(self, model: nn.Module) -> None:
for n, p in model.named_parameters():
if "svd_params" not in n and not n.endswith(("_U_low", "_S_low", "_V_low")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check if self.prefix is in the parameter name as to reduce the risk of overriding similarly named parameters.


def __init__(self, base_layer: nn.Module, **kwargs) -> None:
self.base_layer = base_layer
self.effective_rank = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding (no change necessary): we diverge in naming from LoRA's r parameter here because there's still the option of adding the importance weighting and if we'd add that then

  • effective_rank overrides importance metric, layer-wise rank
  • target and minimum rank as additional hyper params to compute the effective rank of layers according to their importance

Do I understand this correctly?

Comment on lines +118 to +127
model = get_peft_model(base_model, OSFConfig(effective_rank=8))
train_task(model, task_1_data)

# Task 2: Continue training on domain B
# OSF automatically preserves Task 1 knowledge
train_task(model, task_2_data)

# Task 3: Continue with domain C
train_task(model, task_3_data)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests that I can train task1 immediately after task2, is that true? I would have imagined that you'd need to recompute the SVD as to not 'override' the previous task.

Comment on lines 44 to 52
svd = {
"U_high": U[:, :k].contiguous().detach().to(device=device_local, dtype=orig_dtype),
"S_high": S[:k].contiguous().detach().to(device=device_local, dtype=orig_dtype),
"V_high": Vt[:k, :].contiguous().detach().to(device=device_local, dtype=orig_dtype),
"U_low": nn.Parameter(U[:, k:].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
"S_low": nn.Parameter(S[k:].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
"V_low": nn.Parameter(Vt[k:, :].contiguous().detach().to(device=device_local, dtype=orig_dtype)),
"rank_high": k,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation!

The sequential dependency of later added adapters to previous adapters removes a lot of the convenience gained by being able to remove individual adapters, I agree.

I'm OK with not implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants