diff --git a/docs/source/best_of_n.md b/docs/source/best_of_n.md
index 3c6fb984c1..2935c472dd 100644
--- a/docs/source/best_of_n.md
+++ b/docs/source/best_of_n.md
@@ -44,7 +44,7 @@ best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=o
 ```
 
 There is the option of setting the generation settings (like `temperature`, `pad_token_id`) at the time of instance creation as opposed to when calling the `generate` method.
-This is done by passing a `GenerationConfig` from the `transformers` library at the time of initialization
+This is done by passing a [`~transformers.GenerationConfig`] from the `transformers` library at the time of initialization
 
 ```python
 
diff --git a/docs/source/customization.md b/docs/source/customization.md
index c24f5e73e6..5989858122 100644
--- a/docs/source/customization.md
+++ b/docs/source/customization.md
@@ -112,7 +112,7 @@ trainer.train()
 
 ## Use the accelerator cache optimizer
 
-When training large models, you should better handle the accelerator cache by iteratively clearing it. To do so, simply pass `optimize_device_cache=True` to `DPOConfig`:
+When training large models, you should better handle the accelerator cache by iteratively clearing it. To do so, simply pass `optimize_device_cache=True` to [`DPOConfig`]:
 
 ```python
 training_args = DPOConfig(..., optimize_device_cache=True)
diff --git a/docs/source/judges.md b/docs/source/judges.md
index 1f3d0a0ab2..be7fc140ce 100644
--- a/docs/source/judges.md
+++ b/docs/source/judges.md
@@ -13,7 +13,7 @@ pip install trl[judges]
 
 ## Using the provided judges
 
-TRL provides several judges out of the box. For example, you can use the `HfPairwiseJudge` to compare two completions using a pre-trained model from the Hugging Face model hub:
+TRL provides several judges out of the box. For example, you can use the [`HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
 
 ```python
 from trl import HfPairwiseJudge
diff --git a/docs/source/logging.md b/docs/source/logging.md
index 1a7a573dbc..af59efb962 100644
--- a/docs/source/logging.md
+++ b/docs/source/logging.md
@@ -3,7 +3,7 @@
 As reinforcement learning algorithms are historically challenging to debug, it's important to pay careful attention to logging.
 By default, TRL trainers like [`PPOTrainer`] and [`GRPOTrainer`] save a lot of relevant information to supported experiment trackers like Trackio, Weights & Biases (wandb) or TensorBoard.
 
-Upon initialization, pass the `report_to` argument to the respective configuration object (e.g., [`PPOConfig`] for `PPOTrainer`, or [`GRPOConfig`] for `GRPOTrainer`):
+Upon initialization, pass the `report_to` argument to the respective configuration object (e.g., [`PPOConfig`] for [`PPOTrainer`], or [`GRPOConfig`] for [`GRPOTrainer`]):
 
 ```python
 # For PPOTrainer
@@ -19,7 +19,7 @@ grpo_config = GRPOConfig(
 )
 ```
 
-If you want to log with TensorBoard, you might also need to specify logging directories, for example, by adding `logging_dir=PATH_TO_LOGS` to the configuration object (e.g., `PPOConfig` or `GRPOConfig`).
+If you want to log with TensorBoard, you might also need to specify logging directories, for example, by adding `logging_dir=PATH_TO_LOGS` to the configuration object (e.g., [`PPOConfig`] or [`GRPOConfig`]).
 
 ## PPO Logging
 
@@ -83,9 +83,9 @@ Here's a brief explanation for the logged metrics provided in the data for the G
 
 ### Policy and Loss Metrics
 
-* `kl`: The mean Kullback-Leibler (KL) divergence between the current policy and the reference policy. This is logged only if `beta` (the KL coefficient in `GRPOConfig`) is non-zero.
+* `kl`: The mean Kullback-Leibler (KL) divergence between the current policy and the reference policy. This is logged only if `beta` (the KL coefficient in [`GRPOConfig`]) is non-zero.
 * `entropy`: Average entropy of token predictions across generated completions.
-* If Liger GRPOLoss is used (`use_liger_loss: True` in `GRPOConfig`):
+* If Liger GRPOLoss is used (`use_liger_loss: True` in [`GRPOConfig`]):
   * `clip_ratio`: The fraction of policy updates where the probability ratio was clipped according to the GRPO loss's epsilon bounds.
 * If standard GRPOLoss is used (`use_liger_loss: False`):
   * `clip_ratio/low_mean`: The mean fraction of instances where the probability ratio `r_t(θ)` was clipped at the lower bound `1 - epsilon_low` (occurs when advantage is negative and ratio is below the bound).
diff --git a/docs/source/paper_index.md b/docs/source/paper_index.md
index f46c372a4b..2160425edb 100644
--- a/docs/source/paper_index.md
+++ b/docs/source/paper_index.md
@@ -338,7 +338,7 @@ training_args = DPOConfig(
 )
 ```
 
-For the unpaired version, the user should utilize `BCOConfig` and `BCOTrainer`.
+For the unpaired version, the user should utilize [`BCOConfig`] and [`BCOTrainer`].
 
 ### Self-Play Preference Optimization for Language Model Alignment
 
diff --git a/docs/source/peft_integration.md b/docs/source/peft_integration.md
index 27355669f8..8e8709a2df 100644
--- a/docs/source/peft_integration.md
+++ b/docs/source/peft_integration.md
@@ -20,7 +20,7 @@ Note: if you don't want to log with `wandb` remove `log_with="wandb"` in the scr
 
 ## How to use it?
 
-Simply declare a `PeftConfig` object in your script and pass it through `.from_pretrained` to load the TRL+PEFT model.
+Simply declare a [`~peft.PeftConfig`] object in your script and pass it through `.from_pretrained` to load the TRL+PEFT model.
 
 ```python
 from peft import LoraConfig
diff --git a/docs/source/reducing_memory_usage.md b/docs/source/reducing_memory_usage.md
index 6d97eed525..42221c1281 100644
--- a/docs/source/reducing_memory_usage.md
+++ b/docs/source/reducing_memory_usage.md
@@ -77,7 +77,7 @@ Packing, introduced in [Raffel et al., 2020](https://huggingface.co/papers/1910.
 Packing reduces padding by merging several sequences in one row when possible. We use an advanced method to be near-optimal in the way we pack the dataset. To enable packing, use `packing=True` in the [`SFTConfig`].
 
 > [!TIP]
-> In TRL 0.18 and earlier, packing used a more aggressive method that reduced padding to almost nothing, but had the downside of breaking sequence continuity for a large fraction of the dataset. To revert to this strategy, use `packing_strategy="wrapped"` in `SFTConfig`.
+> In TRL 0.18 and earlier, packing used a more aggressive method that reduced padding to almost nothing, but had the downside of breaking sequence continuity for a large fraction of the dataset. To revert to this strategy, use `packing_strategy="wrapped"` in [`SFTConfig`].
 
 ```python
 from trl import SFTConfig
diff --git a/trl/data_utils.py b/trl/data_utils.py
index 454dd24af1..e80311c211 100644
--- a/trl/data_utils.py
+++ b/trl/data_utils.py
@@ -271,7 +271,7 @@ def maybe_apply_chat_template(
             messages, where each message is a dictionary with keys `"role"` and `"content"`. Additionally, the example
             may contain a `"chat_template_kwargs"` key, which is a dictionary of additional keyword arguments to pass
             to the chat template renderer.
-        tokenizer (`PreTrainedTokenizerBase`):
+        tokenizer ([`~transformers.PreTrainedTokenizerBase`]):
             Tokenizer to apply the chat template with.
         tools (`list[Union[dict, Callable]]`, *optional*):
             A list of tools (callable functions) that will be accessible to the model. If the template does not support
@@ -328,7 +328,7 @@ def unpair_preference_dataset(
     Unpair a preference dataset.
 
     Args:
-        dataset (`Dataset` or `DatasetDict`):
+        dataset ([`~datasets.Dataset`] or [`~datasets.DatasetDict`]):
             Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally
             `"prompt"`.
         num_proc (`int`, *optional*):
@@ -337,7 +337,7 @@ def unpair_preference_dataset(
             Meaningful description to be displayed alongside with the progress bar while mapping examples.
 
     Returns:
-        `Dataset`: The unpaired preference dataset.
+        [`~datasets.Dataset`]: The unpaired preference dataset.
 
     Example:
 
@@ -371,7 +371,7 @@ def maybe_unpair_preference_dataset(
     Unpair a preference dataset if it is paired.
 
     Args:
-        dataset (`Dataset` or `DatasetDict`):
+        dataset ([`~datasets.Dataset`] or [`~datasets.DatasetDict`]):
             Preference dataset to unpair. The dataset must have columns `"chosen"`, `"rejected"` and optionally
             `"prompt"`.
         num_proc (`int`, *optional*):
@@ -380,7 +380,8 @@ def maybe_unpair_preference_dataset(
             Meaningful description to be displayed alongside with the progress bar while mapping examples.
 
     Returns:
-        `Dataset` or `DatasetDict`: The unpaired preference dataset if it was paired, otherwise the original dataset.
+        [`~datasets.Dataset`] or [`~datasets.DatasetDict`]: The unpaired preference dataset if it was paired, otherwise
+        the original dataset.
 
     Example:
 
@@ -473,7 +474,7 @@ def maybe_extract_prompt(example: dict[str, list]) -> dict[str, list]:
      'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}
     ```
 
-    Or, with the `map` method of `datasets.Dataset`:
+    Or, with the `map` method of [`~datasets.Dataset`]:
 
     ```python
     >>> from trl import extract_prompt
@@ -664,7 +665,7 @@ def pack_dataset(
     Pack sequences in a dataset into chunks of size `seq_length`.
 
     Args:
-        dataset (`Dataset` or `DatasetDict`):
+        dataset ([`~datasets.Dataset`] or [`~datasets.DatasetDict`]):
             Dataset to pack
         seq_length (`int`):
             Target sequence length to pack to.
@@ -679,8 +680,8 @@ def pack_dataset(
             Additional keyword arguments to pass to the dataset's map method when packing examples.
 
     Returns:
-        `Dataset` or `DatasetDict`: The dataset with packed sequences. The number of examples may decrease as sequences
-        are combined.
+        [`~datasets.Dataset`] or [`~datasets.DatasetDict`]: The dataset with packed sequences. The number of examples
+        may decrease as sequences are combined.
 
     Example:
     ```python
@@ -720,7 +721,7 @@ def truncate_dataset(
     Truncate sequences in a dataset to a specified `max_length`.
 
     Args:
-        dataset (`Dataset` or `DatasetDict`):
+        dataset ([`~datasets.Dataset`] or [`~datasets.DatasetDict`]):
             Dataset to truncate.
         max_length (`int`):
             Maximum sequence length to truncate to.
@@ -728,7 +729,7 @@ def truncate_dataset(
             Additional keyword arguments to pass to the dataset's map method when truncating examples.
 
     Returns:
-        `Dataset` or `DatasetDict`: The dataset with truncated sequences.
+        [`~datasets.Dataset`] or [`~datasets.DatasetDict`]: The dataset with truncated sequences.
 
     Example:
     ```python
diff --git a/trl/mergekit_utils.py b/trl/mergekit_utils.py
index 7a64bfadf8..fc9787b8f6 100644
--- a/trl/mergekit_utils.py
+++ b/trl/mergekit_utils.py
@@ -264,7 +264,7 @@ def merge_models(config: MergeConfig, out_path: str):
     Merge two models using mergekit
 
     Args:
-        config (`MergeConfig`): The merge configuration.
+        config ([`MergeConfig`]): The merge configuration.
         out_path (`str`): The output path for the merged model.
     """
     if not is_mergekit_available():
diff --git a/trl/models/modeling_base.py b/trl/models/modeling_base.py
index d73f361add..9dd831d019 100644
--- a/trl/models/modeling_base.py
+++ b/trl/models/modeling_base.py
@@ -57,14 +57,17 @@
 
 
 class PreTrainedModelWrapper(nn.Module):
-    r"""
-    A wrapper class around a (`transformers.PreTrainedModel`) to be compatible with the (`~transformers.PreTrained`)
-    class in order to keep some attributes and methods of the (`~transformers.PreTrainedModel`) class.
+    """
+    Wrapper for a [`~transformers.PreTrainedModel`] implemented as a standard PyTorch [`torch.nn.Module`].
+
+    This class provides a compatibility layer that preserves the key attributes and methods of the original
+    [`~transformers.PreTrainedModel`], while exposing a uniform interface consistent with PyTorch modules. It enables
+    seamless integration of pretrained Transformer models into custom training, evaluation, or inference workflows.
 
     Attributes:
-        pretrained_model (`transformers.PreTrainedModel`):
+        pretrained_model ([`~transformers.PreTrainedModel`]):
             The model to be wrapped.
-        parent_class (`transformers.PreTrainedModel`):
+        parent_class ([`~transformers.PreTrainedModel`]):
             The parent class of the model to be wrapped.
         supported_args (`list`):
             The list of arguments that are supported by the wrapper class.
@@ -111,19 +114,20 @@ def __init__(
     def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
         r"""
         Instantiates a new model from a pretrained model from `transformers`. The pretrained model is loaded using the
-        `from_pretrained` method of the `transformers.PreTrainedModel` class. The arguments that are specific to the
-        `transformers.PreTrainedModel` class are passed along this method and filtered out from the `kwargs` argument.
+        `from_pretrained` method of the [`~transformers.PreTrainedModel`] class. The arguments that are specific to the
+        [`~transformers.PreTrainedModel`] class are passed along this method and filtered out from the `kwargs`
+        argument.
 
         Args:
-            pretrained_model_name_or_path (`str` or `transformers.PreTrainedModel`):
+            pretrained_model_name_or_path (`str` or [`~transformers.PreTrainedModel`]):
                 The path to the pretrained model or its name.
-            *model_args (`list`, *optional*)):
+            *model_args (`list`, *optional*):
                 Additional positional arguments passed along to the underlying model's `from_pretrained` method.
             **kwargs (`dict`, *optional*):
                 Additional keyword arguments passed along to the underlying model's `from_pretrained` method. We also
-                pre-process the kwargs to extract the arguments that are specific to the `transformers.PreTrainedModel`
-                class and the arguments that are specific to trl models. The kwargs also support
-                `prepare_model_for_kbit_training` arguments from `peft` library.
+                pre-process the kwargs to extract the arguments that are specific to the
+                [`~transformers.PreTrainedModel`] class and the arguments that are specific to trl models. The kwargs
+                also support `prepare_model_for_kbit_training` arguments from `peft` library.
         """
         if kwargs is not None:
             peft_config = kwargs.pop("peft_config", None)
@@ -507,8 +511,8 @@ def add_and_load_reward_modeling_adapter(
     def push_to_hub(self, *args, **kwargs):
         r"""
         Push the pretrained model to the hub. This method is a wrapper around
-        `transformers.PreTrainedModel.push_to_hub`. Please refer to the documentation of
-        `transformers.PreTrainedModel.push_to_hub` for more information.
+        [`~transformers.PreTrainedModel.push_to_hub`]. Please refer to the documentation of
+        [`~transformers.PreTrainedModel.push_to_hub`] for more information.
 
         Args:
             *args (`list`, *optional*):
@@ -521,8 +525,8 @@ def push_to_hub(self, *args, **kwargs):
     def save_pretrained(self, *args, **kwargs):
         r"""
         Save the pretrained model to a directory. This method is a wrapper around
-        `transformers.PreTrainedModel.save_pretrained`. Please refer to the documentation of
-        `transformers.PreTrainedModel.save_pretrained` for more information.
+        [`~transformers.PreTrainedModel.save_pretrained`]. Please refer to the documentation of
+        [`~transformers.PreTrainedModel.save_pretrained`] for more information.
 
         Args:
             *args (`list`, *optional*):
@@ -596,14 +600,14 @@ def create_reference_model(
     Creates a static reference copy of a model. Note that model will be in `.eval()` mode.
 
     Args:
-        model (`PreTrainedModelWrapper`): The model to be copied.
+        model ([`PreTrainedModelWrapper`]): The model to be copied.
         num_shared_layers (`int`, *optional*):
             The number of initial layers that are shared between both models and kept frozen.
         pattern (`str`, *optional*): The shared layers are selected with a string pattern
             (e.g. "transformer.h.{layer}" for GPT2) and if a custom pattern is necessary it can be passed here.
 
     Returns:
-        `PreTrainedModelWrapper`
+        [`PreTrainedModelWrapper`]
     """
     if is_deepspeed_zero3_enabled():
         raise ValueError(
@@ -665,13 +669,13 @@ def create_reference_model(
 
 
 class GeometricMixtureWrapper(GenerationMixin):
-    r"""
+    """
     Geometric Mixture generation wrapper that samples from the logits of two model's geometric mixture.
 
     Args:
-        model (`PreTrainedModel`): The model to be wrapped.
-        ref_model (`PreTrainedModel`): The reference model.
-        generation_config (`GenerationConfig`): The generation config.
+        model ([`~transformers.PreTrainedModel`]): The model to be wrapped.
+        ref_model ([`~transformers.PreTrainedModel`]): The reference model.
+        generation_config ([`~transformers.GenerationConfig`]): The generation config.
         mixture_coef (`float`, *optional* - default: 0.5): The mixture coefficient.
     """
 
diff --git a/trl/models/modeling_value_head.py b/trl/models/modeling_value_head.py
index f32953be41..7808b16419 100644
--- a/trl/models/modeling_value_head.py
+++ b/trl/models/modeling_value_head.py
@@ -60,26 +60,27 @@ def forward(self, hidden_states):
 
 
 class AutoModelForCausalLMWithValueHead(PreTrainedModelWrapper):
-    r"""
+    """
     An autoregressive model with a value head in addition to the language model head. This class inherits from
-    `~trl.PreTrainedModelWrapper` and wraps a `transformers.PreTrainedModel` class. The wrapper class supports classic
+    [`PreTrainedModelWrapper`] and wraps a [`~transformers.PreTrainedModel`] class. The wrapper class supports classic
     functions such as `from_pretrained`, `push_to_hub` and `generate`. To call a method of the wrapped model, simply
     manipulate the `pretrained_model` attribute of this class.
 
     Class attributes:
-        - **transformers_parent_class** (`transformers.PreTrainedModel`) -- The parent class of the wrapped model. This
+        - **transformers_parent_class** ([`~transformers.PreTrainedModel`]) -- The parent class of the wrapped model.
+          This
             should be set to `transformers.AutoModelForCausalLM` for this class.
         - **supported_args** (`tuple`) -- A tuple of strings that are used to identify the arguments that are supported
-            by the `ValueHead` class. Currently, the supported args are:
+            by the [`ValueHead`] class. Currently, the supported args are:
             - **summary_dropout_prob** (`float`, `optional`, defaults to `None`) -- The dropout probability for the
-                `ValueHead` class.
+                [`ValueHead`] class.
             - **v_head_initializer_range** (`float`, `optional`, defaults to `0.2`) -- The initializer range for the
-                `ValueHead` if a specific initialization strategy is selected.
+                [`ValueHead`] if a specific initialization strategy is selected.
             - **v_head_init_strategy** (`str`, `optional`, defaults to `None`) -- The initialization strategy for the
-                `ValueHead`. Currently, the supported strategies are:
-                - **`None`** -- Initializes the weights of the `ValueHead` with a random distribution. This is the
+                [`ValueHead`]. Currently, the supported strategies are:
+                - **`None`** -- Initializes the weights of the [`ValueHead`] with a random distribution. This is the
                   default strategy.
-                - **"normal"** -- Initializes the weights of the `ValueHead` with a normal distribution.
+                - **"normal"** -- Initializes the weights of the [`ValueHead`] with a normal distribution.
     """
 
     transformers_parent_class = AutoModelForCausalLM
@@ -90,15 +91,15 @@ class AutoModelForCausalLMWithValueHead(PreTrainedModelWrapper):
     )
 
     def __init__(self, pretrained_model, **kwargs):
-        r"""
+        """
         Initializes the model.
 
         Args:
-            pretrained_model (`transformers.PreTrainedModel`):
+            pretrained_model ([`~transformers.PreTrainedModel`]):
                 The model to wrap. It should be a causal language model such as GPT2. or any model mapped inside the
                 `AutoModelForCausalLM` class.
             kwargs (`dict`, `optional`):
-                Additional keyword arguments, that are passed to the `ValueHead` class.
+                Additional keyword arguments, that are passed to the [`ValueHead`] class.
         """
         super().__init__(pretrained_model, **kwargs)
         v_head_kwargs, _, _ = self._split_kwargs(kwargs)
@@ -114,8 +115,8 @@ def _init_weights(self, **kwargs):
 
         Args:
             **kwargs (`dict`, `optional`):
-                Additional keyword arguments, that are passed to the `ValueHead` class. These arguments can contain the
-                `v_head_init_strategy` argument as well as the `v_head_initializer_range` argument.
+                Additional keyword arguments, that are passed to the [`ValueHead`] class. These arguments can contain
+                the `v_head_init_strategy` argument as well as the `v_head_initializer_range` argument.
         """
         initializer_range = kwargs.pop("v_head_initializer_range", 0.2)
         # random init by default
@@ -263,18 +264,18 @@ def set_device_hook(module, input, outputs):
 
 
 class AutoModelForSeq2SeqLMWithValueHead(PreTrainedModelWrapper):
-    r"""
+    """
     A seq2seq model with a value head in addition to the language model head. This class inherits from
-    `~trl.PreTrainedModelWrapper` and wraps a `transformers.PreTrainedModel` class. The wrapper class supports classic
+    [`PreTrainedModelWrapper`] and wraps a [`~transformers.PreTrainedModel`] class. The wrapper class supports classic
     functions such as `from_pretrained` and `push_to_hub` and also provides some additional functionalities such as
     `generate`.
 
     Args:
-        pretrained_model (`transformers.PreTrainedModel`):
+        pretrained_model ([`~transformers.PreTrainedModel`]):
             The model to wrap. It should be a causal language model such as GPT2. or any model mapped inside the
-            `AutoModelForSeq2SeqLM` class.
+            [`~transformers.AutoModelForSeq2SeqLM`] class.
         kwargs:
-            Additional keyword arguments passed along to the `ValueHead` class.
+            Additional keyword arguments passed along to the [`ValueHead`] class.
     """
 
     transformers_parent_class = AutoModelForSeq2SeqLM
diff --git a/trl/models/utils.py b/trl/models/utils.py
index 1bdaad82e8..1f58c38425 100644
--- a/trl/models/utils.py
+++ b/trl/models/utils.py
@@ -102,15 +102,15 @@ def setup_chat_format(
     `tokenizer.chat_template` to `None`.
 
     Args:
-        model (`~transformers.PreTrainedModel`): The model to be modified.
-        tokenizer (`~transformers.PreTrainedTokenizer`): The tokenizer to be modified.
+        model ([`~transformers.PreTrainedModel`]): The model to be modified.
+        tokenizer ([`~transformers.PreTrainedTokenizer`]): The tokenizer to be modified.
         format (`Optional[Literal["chatml"]]`): The format to be set. Defaults to "chatml".
         resize_to_multiple_of (`int` or `None`): Number to resize the embedding layer to. Defaults to None.
 
     Returns:
-        model (`~transformers.PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             The modified model.
-        tokenizer (`~transformers.PreTrainedTokenizer`):
+        tokenizer ([`~transformers.PreTrainedTokenizer`]):
             The modified tokenizer.
     """
     warnings.warn(
@@ -178,9 +178,9 @@ def clone_chat_template(
       the embedding dimensions.
 
     Args:
-        model (`PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             Model to update.
-        tokenizer (`PreTrainedTokenizer`):
+        tokenizer ([`~transformers.PreTrainedTokenizer`]):
             Tokenizer to update.
         source_tokenizer_path (`str`):
             Path or identifier of the pretrained tokenizer to clone from.
@@ -189,9 +189,9 @@ def clone_chat_template(
             new vocabulary size to the nearest multiple of this value.
 
     Returns:
-        model (`PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             Updated model with resized token embeddings and EOS token configured.
-        tokenizer (`~transformers.PreTrainedTokenizer`):
+        tokenizer ([`~transformers.PreTrainedTokenizer`]):
             Updated tokenizer with the chat template and special tokens applied.
         added_tokens (`list[int]`):
             List of tokens that were added to the tokenizer from the source tokenizer.
@@ -316,7 +316,7 @@ def unwrap_model_for_generation(
     Args:
         model (`Union[DistributedDataParallel, DeepSpeedEngine]`):
             Model to be unwrapped.
-        accelerator (`~accelerate.Accelerator`):
+        accelerator ([`~accelerate.Accelerator`]):
             Accelerator instance managing the model.
         gather_deepspeed3_params (`bool`, *optional*, defaults to `True`):
             Whether to gather weights for DeepSpeed ZeRO Stage 3 models. If `False`, skips parameter gathering, which
diff --git a/trl/scripts/utils.py b/trl/scripts/utils.py
index 982dd88921..1d92d91bb4 100644
--- a/trl/scripts/utils.py
+++ b/trl/scripts/utils.py
@@ -60,7 +60,8 @@ class DatasetConfig:
     Configuration for a dataset.
 
     This class matches the signature of [`~datasets.load_dataset`] and the arguments are used directly in the
-    `datasets.load_dataset` function. You can refer to the `datasets.load_dataset` documentation for more details.
+    [`~datasets.load_dataset`] function. You can refer to the [`~datasets.load_dataset`] documentation for more
+    details.
 
     Parameters:
         path (`str`):
@@ -422,11 +423,11 @@ def get_dataset(mixture_config: DatasetMixtureConfig) -> DatasetDict:
     Load a mixture of datasets based on the configuration.
 
     Args:
-        mixture_config (`DatasetMixtureConfig`):
+        mixture_config ([`DatasetMixtureConfig`]):
             Script arguments containing dataset configuration.
 
     Returns:
-        `DatasetDict`:
+        [`~datasets.DatasetDict`]:
             Combined dataset(s) from the mixture configuration, with optional train/test split if `test_split_size` is
             set.
 
diff --git a/trl/trainer/bco_trainer.py b/trl/trainer/bco_trainer.py
index b0aebb6fbc..e1bba2ebc8 100644
--- a/trl/trainer/bco_trainer.py
+++ b/trl/trainer/bco_trainer.py
@@ -283,25 +283,25 @@ class BCOTrainer(BaseTrainer):
     Initialize BCOTrainer from [BCO](https://huggingface.co/papers/2404.04656) paper.
 
     Args:
-        model (`transformers.PreTrainedModel`):
-            The model to train, preferably an `AutoModelForSequenceClassification`.
-        ref_model (`PreTrainedModelWrapper`):
+        model ([`~transformers.PreTrainedModel`]):
+            The model to train, preferably an [`~transformers.AutoModelForSequenceClassification`].
+        ref_model ([`PreTrainedModelWrapper`]):
             Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation
             and loss. If no reference model is provided, the trainer will create a reference model with the same
             architecture as the model to be optimized.
-        args (`BCOConfig`):
+        args ([`BCOConfig`]):
             The arguments to use for training.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
             for the model, and it will be saved along the model to make it easier to rerun an interrupted training or
             reuse the fine-tuned model.
-        data_collator (`transformers.DataCollator`, *optional*):
+        data_collator ([`~transformers.DataCollator`], *optional*):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
         model_init (`Callable[[], transformers.PreTrainedModel]`):
             The model initializer to use for training. If None is specified, the default model initializer will be
diff --git a/trl/trainer/callbacks.py b/trl/trainer/callbacks.py
index 2c1240b0fc..5bf5cf60ce 100644
--- a/trl/trainer/callbacks.py
+++ b/trl/trainer/callbacks.py
@@ -251,14 +251,14 @@ class WinRateCallback(TrainerCallback):
     ```
 
     Args:
-        judge (`BasePairwiseJudge`):
+        judge ([`BasePairwiseJudge`]):
             The judge to use for comparing completions.
         trainer (`Trainer`):
             Trainer to which the callback will be attached. The trainer's evaluation dataset must include a `"prompt"`
             column containing the prompts for generating completions. If the `Trainer` has a reference model (via the
             `ref_model` attribute), it will use this reference model for generating the reference completions;
             otherwise, it defaults to using the initial model.
-        generation_config (`GenerationConfig`, *optional*):
+        generation_config ([`~transformers.GenerationConfig`], *optional*):
             The generation config to use for generating completions.
         num_prompts (`int`, *optional*):
             The number of prompts to generate completions for. If not provided, defaults to the number of examples in
@@ -439,7 +439,7 @@ class LogCompletionsCallback(TrainerCallback):
         trainer (`Trainer`):
             Trainer to which the callback will be attached. The trainer's evaluation dataset must include a `"prompt"`
             column containing the prompts for generating completions.
-        generation_config (`GenerationConfig`, *optional*):
+        generation_config ([`~transformers.GenerationConfig`], *optional*):
             The generation config to use for generating completions.
         num_prompts (`int`, *optional*):
             The number of prompts to generate completions for. If not provided, defaults to the number of examples in
@@ -569,7 +569,7 @@ def accuracy_scorer(prompt: str, completion: str) -> float:
             Dictionary mapping scorer names to scorer functions. If `None`, operates in tracing mode (predictions
             only). If provided, operates in evaluation mode (predictions + scores + summary). Scorer functions should
             have signature: `scorer(prompt: str, completion: str) -> Union[float, int]`
-        generation_config (`GenerationConfig`, *optional*):
+        generation_config ([`~transformers.GenerationConfig`], *optional*):
             Generation config to use for generating completions.
         num_prompts (`int` or `None`, *optional*):
             Number of prompts to generate completions for. If not provided, defaults to the number of examples in the
diff --git a/trl/trainer/cpo_trainer.py b/trl/trainer/cpo_trainer.py
index 1cd46ae792..5c483cfb6b 100644
--- a/trl/trainer/cpo_trainer.py
+++ b/trl/trainer/cpo_trainer.py
@@ -77,17 +77,17 @@ class CPOTrainer(BaseTrainer):
     Initialize CPOTrainer.
 
     Args:
-        model (`transformers.PreTrainedModel`):
-            The model to train, preferably an `AutoModelForSequenceClassification`.
-        args (`CPOConfig`):
+        model ([`~transformers.PreTrainedModel`]):
+            The model to train, preferably an [`~transformers.AutoModelForSequenceClassification`].
+        args ([`CPOConfig`]):
             The CPO config arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
diff --git a/trl/trainer/dpo_trainer.py b/trl/trainer/dpo_trainer.py
index bfcc4b4c53..f82ba4a0b6 100644
--- a/trl/trainer/dpo_trainer.py
+++ b/trl/trainer/dpo_trainer.py
@@ -197,13 +197,13 @@ class DPOTrainer(BaseTrainer):
               using [`~transformers.AutoModelForCausalLM.from_pretrained`] with the keyword arguments in
               `args.model_init_kwargs`.
             - A [`~transformers.PreTrainedModel`] object. Only causal language models are supported.
-        ref_model (`PreTrainedModelWrapper`):
+        ref_model ([`PreTrainedModelWrapper`]):
             Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation
             and loss. If no reference model is provided, the trainer will create a reference model with the same
             architecture as the model to be optimized.
         args ([`DPOConfig`], *optional*):
             Configuration for this trainer. If `None`, a default configuration is used.
-        data_collator (`DataCollator`, *optional*):
+        data_collator ([`~transformers.DataCollator`], *optional*):
             Function to use to form a batch from a list of elements of the processed `train_dataset` or `eval_dataset`.
             Will default to [`DataCollatorForPreference`].
         train_dataset ([`~datasets.Dataset`] or [`~datasets.IterableDataset`]):
@@ -689,7 +689,7 @@ def tokenize_row(
         Args:
             features (`dict[str, str]`):
                 Row of the dataset, should contain the keys `"prompt"`, `"chosen"`, and `"rejected"`.
-            processing_class (`PreTrainedTokenizerBase`):
+            processing_class ([`~transformers.PreTrainedTokenizerBase`]):
                 Processing class used to process the data.
             max_prompt_length (`int` or `None`):
                 Maximum length of the prompt sequence. If `None`, the prompt sequence is not truncated.
diff --git a/trl/trainer/grpo_config.py b/trl/trainer/grpo_config.py
index 9c4f4b3a68..f6f3c6e346 100644
--- a/trl/trainer/grpo_config.py
+++ b/trl/trainer/grpo_config.py
@@ -92,10 +92,10 @@ class GRPOConfig(TrainingArguments):
         cache_implementation (`str`, *optional*):
             Implementation of the cache method for faster generation when `use_vllm` is set to `False`.
         generation_kwargs (`dict[str, Any]`, *optional*):
-            Additional keyword arguments to pass to `GenerationConfig` (if using transformers) or `SamplingParams` (if
-            using vLLM) when sampling completions. This can be used to further customize the generation behavior, such
-            as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict with the other generation
-            parameters (like `min_p`, `top_p`, etc.), they will override them.
+            Additional keyword arguments to pass to [`~transformers.GenerationConfig`] (if using transformers) or
+            `SamplingParams` (if using vLLM) when sampling completions. This can be used to further customize the
+            generation behavior, such as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict
+            with the other generation parameters (like `min_p`, `top_p`, etc.), they will override them.
 
         > Parameters that control generation acceleration powered by vLLM
 
diff --git a/trl/trainer/grpo_trainer.py b/trl/trainer/grpo_trainer.py
index 86c464190d..352a0144ef 100644
--- a/trl/trainer/grpo_trainer.py
+++ b/trl/trainer/grpo_trainer.py
@@ -174,7 +174,7 @@ def reward_func(completions, **kwargs):
             processing class is loaded from the model's name with [`~transformers.AutoProcessor.from_pretrained`]. A
             padding token, `tokenizer.pad_token`, must be set. If the processing class has not set a padding token,
             `tokenizer.eos_token` will be used as the default.
-        reward_processing_classes (`Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]`, *optional*):
+        reward_processing_classes ([`~transformers.PreTrainedTokenizerBase`] or `list[PreTrainedTokenizerBase]`, *optional*):
             Processing classes corresponding to the reward functions specified in `reward_funcs`. Can be either:
 
             - A single processing class: Used when `reward_funcs` contains only one reward function.
diff --git a/trl/trainer/kto_trainer.py b/trl/trainer/kto_trainer.py
index ef65a45d77..3d3045598c 100644
--- a/trl/trainer/kto_trainer.py
+++ b/trl/trainer/kto_trainer.py
@@ -279,25 +279,25 @@ class KTOTrainer(BaseTrainer):
     Initialize KTOTrainer.
 
     Args:
-        model (`transformers.PreTrainedModel`):
-            The model to train, preferably an `AutoModelForSequenceClassification`.
-        ref_model (`PreTrainedModelWrapper`):
+        model ([`~transformers.PreTrainedModel`]):
+            The model to train, preferably an [`~transformers.AutoModelForSequenceClassification`].
+        ref_model ([`PreTrainedModelWrapper`]):
             Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation
             and loss. If no reference model is provided, the trainer will create a reference model with the same
             architecture as the model to be optimized.
-        args (`KTOConfig`):
+        args ([`KTOConfig`]):
             The arguments to use for training.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
             for the model, and it will be saved along the model to make it easier to rerun an interrupted training or
             reuse the fine-tuned model.
-        data_collator (`transformers.DataCollator`, *optional*):
+        data_collator ([`~transformers.DataCollator`], *optional*):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
         model_init (`Callable[[], transformers.PreTrainedModel]`):
             The model initializer to use for training. If None is specified, the default model initializer will be
diff --git a/trl/trainer/nash_md_trainer.py b/trl/trainer/nash_md_trainer.py
index 184d1f95f1..ad2c20be7a 100644
--- a/trl/trainer/nash_md_trainer.py
+++ b/trl/trainer/nash_md_trainer.py
@@ -58,25 +58,26 @@ class NashMDTrainer(OnlineDPOTrainer):
     It is implemented as a subclass of [`OnlineDPOTrainer`].
 
     Args:
-        model (`transformers.PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             The model to train, preferably an `AutoModelForCausalLM`.
-        ref_model (`PreTrainedModelWrapper`):
+        ref_model ([`PreTrainedModelWrapper`]):
             Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation
             and loss. If no reference model is provided, the trainer will create a reference model with the same
             architecture as the model to be optimized.
-        reward_funcs (`transformers.PreTrainedModel`):
-            The reward model to score completions with, preferably an `AutoModelForSequenceClassification`.
-        judge (`BasePairwiseJudge`):
+        reward_funcs ([`~transformers.PreTrainedModel`]):
+            The reward model to score completions with, preferably an
+            [`~transformers.AutoModelForSequenceClassification`].
+        judge ([`BasePairwiseJudge`]):
             The judge to use for pairwise comparison of model completions.
-        args (`NashMDConfig`):
+        args ([`NashMDConfig`]):
             The NashMD config arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
diff --git a/trl/trainer/online_dpo_config.py b/trl/trainer/online_dpo_config.py
index 8b4a9472ca..08ed1a6700 100644
--- a/trl/trainer/online_dpo_config.py
+++ b/trl/trainer/online_dpo_config.py
@@ -95,10 +95,10 @@ class may differ from those in [`~transformers.TrainingArguments`].
         cache_implementation (`str`, *optional*):
             Implementation of the cache method for faster generation when `use_vllm` is set to `False`.
         generation_kwargs (`dict[str, Any]`, *optional*):
-            Additional keyword arguments to pass to `GenerationConfig` (if using transformers) or `SamplingParams` (if
-            using vLLM) when sampling completions. This can be used to further customize the generation behavior, such
-            as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict with the other generation
-            parameters (like `min_p`, `top_p`, etc.), they will override them.
+            Additional keyword arguments to pass to [`~transformers.GenerationConfig`] (if using transformers) or
+            `SamplingParams` (if using vLLM) when sampling completions. This can be used to further customize the
+            generation behavior, such as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict
+            with the other generation parameters (like `min_p`, `top_p`, etc.), they will override them.
 
         > Parameters that control generation acceleration powered by vLLM
 
diff --git a/trl/trainer/online_dpo_trainer.py b/trl/trainer/online_dpo_trainer.py
index 581cd9fed3..44d0119974 100644
--- a/trl/trainer/online_dpo_trainer.py
+++ b/trl/trainer/online_dpo_trainer.py
@@ -117,10 +117,10 @@ class OnlineDPOTrainer(BaseTrainer):
               using [`~transformers.AutoModelForCausalLM.from_pretrained`] with the keyword arguments in
               `args.model_init_kwargs`.
             - A [`~transformers.PreTrainedModel`] object. Only causal language models are supported.
-        ref_model (`transformers.PreTrainedModel` or `torch.nn.Module` or `None`):
+        ref_model ([`~transformers.PreTrainedModel`] or `torch.nn.Module` or `None`):
             The reference model to use for training. If None is specified, the reference model will be created from the
             model.
-        judge (`BasePairwiseJudge`):
+        judge ([`BasePairwiseJudge`]):
             The judge to use for pairwise comparison of model completions.
         reward_funcs (`Union[RewardFunc, list[RewardFunc]]`, *optional*):
             Reward functions to be used for computing the rewards. To compute the rewards, we call all the reward
@@ -131,11 +131,11 @@ class OnlineDPOTrainer(BaseTrainer):
             - A list of reward functions: Must all be of compatible types.
 
             Note: Only one of `judge`, or `reward_funcs` should be provided.
-        args (`OnlineDPOConfig`):
+        args ([`OnlineDPOConfig`]):
             The online DPO config arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
         train_dataset ([`~datasets.Dataset`] or [`~datasets.IterableDataset`]):
             The dataset to use for training.
@@ -145,7 +145,7 @@ class OnlineDPOTrainer(BaseTrainer):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
             for the model, and it will be saved along the model to make it easier to rerun an interrupted training or
             reuse the fine-tuned model.
-        reward_processing_classes (`Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]`, *optional*):
+        reward_processing_classes ([`~transformers.PreTrainedTokenizerBase`] or `list[PreTrainedTokenizerBase]`, *optional*):
             Processing classes corresponding to the reward functions specified in `reward_funcs`. Can be either:
 
             - A single processing class: Used when `reward_funcs` contains only one reward function.
diff --git a/trl/trainer/orpo_trainer.py b/trl/trainer/orpo_trainer.py
index 450f14168c..65ea1b5e50 100644
--- a/trl/trainer/orpo_trainer.py
+++ b/trl/trainer/orpo_trainer.py
@@ -81,17 +81,17 @@ class ORPOTrainer(BaseTrainer):
     Initialize ORPOTrainer.
 
     Args:
-        model (`transformers.PreTrainedModel`):
-            The model to train, preferably an `AutoModelForSequenceClassification`.
-        args (`ORPOConfig`):
+        model ([`~transformers.PreTrainedModel`]):
+            The model to train, preferably an [`~transformers.AutoModelForSequenceClassification`].
+        args ([`ORPOConfig`]):
             The ORPO config arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
diff --git a/trl/trainer/prm_trainer.py b/trl/trainer/prm_trainer.py
index 7189e7575c..73d54ec53b 100644
--- a/trl/trainer/prm_trainer.py
+++ b/trl/trainer/prm_trainer.py
@@ -51,17 +51,17 @@ class PRMTrainer(BaseTrainer):
     Initialize PRMTrainer.
 
     Args:
-        model (`transformers.PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             The model to train, preferably an `AutoModelForTokenClassification`.
-        args (`PRMConfig`):
+        args ([`PRMConfig`]):
             The arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DataCollatorForTokenClassification`) will be used which will pad the sequences to the maximum length of
-            the sequences in the batch, given a dataset of paired sequences.
-        train_dataset (`datasets.Dataset`):
+            ([`~transformers.DataCollatorForTokenClassification`]) will be used which will pad the sequences to the
+            maximum length of the sequences in the batch, given a dataset of paired sequences.
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs
@@ -219,7 +219,7 @@ def tokenize_row(
         Args:
             features (`dict[str, str]`):
                 Row of the dataset, should contain the keys `"prompt"`, `"completions"`, and `"labels"`.
-            tokenizer (`PreTrainedTokenizerBase`):
+            tokenizer ([`~transformers.PreTrainedTokenizerBase`]):
                 Tokenizer used to process the data.
             step_separator (`str`):
                 Separator between steps in the completion.
diff --git a/trl/trainer/rloo_config.py b/trl/trainer/rloo_config.py
index 0f90a8e198..eb27ca1f9a 100644
--- a/trl/trainer/rloo_config.py
+++ b/trl/trainer/rloo_config.py
@@ -93,10 +93,10 @@ class RLOOConfig(TrainingArguments):
         cache_implementation (`str`, *optional*):
             Implementation of the cache method for faster generation when `use_vllm` is set to `False`.
         generation_kwargs (`dict[str, Any]`, *optional*):
-            Additional keyword arguments to pass to `GenerationConfig` (if using transformers) or `SamplingParams` (if
-            using vLLM) when sampling completions. This can be used to further customize the generation behavior, such
-            as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict with the other generation
-            parameters (like `min_p`, `top_p`, etc.), they will override them.
+            Additional keyword arguments to pass to [`~transformers.GenerationConfig`] (if using transformers) or
+            `SamplingParams` (if using vLLM) when sampling completions. This can be used to further customize the
+            generation behavior, such as setting `suppress_tokens`, `num_beams`, etc. If it contains keys that conflict
+            with the other generation parameters (like `min_p`, `top_p`, etc.), they will override them.
 
         > Parameters that control generation acceleration powered by vLLM
 
diff --git a/trl/trainer/rloo_trainer.py b/trl/trainer/rloo_trainer.py
index 62b6f77404..2ab9cef69f 100644
--- a/trl/trainer/rloo_trainer.py
+++ b/trl/trainer/rloo_trainer.py
@@ -171,7 +171,7 @@ def reward_func(completions, **kwargs):
             processing class is loaded from the model's name with [`~transformers.AutoProcessor.from_pretrained`]. A
             padding token, `tokenizer.pad_token`, must be set. If the processing class has not set a padding token,
             `tokenizer.eos_token` will be used as the default.
-        reward_processing_classes (`Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]`, *optional*):
+        reward_processing_classes ([`~transformers.PreTrainedTokenizerBase`] or `list[PreTrainedTokenizerBase]`, *optional*):
             Processing classes corresponding to the reward functions specified in `reward_funcs`. Can be either:
 
             - A single processing class: Used when `reward_funcs` contains only one reward function.
diff --git a/trl/trainer/sft_trainer.py b/trl/trainer/sft_trainer.py
index cee2fb82ed..9014470b55 100644
--- a/trl/trainer/sft_trainer.py
+++ b/trl/trainer/sft_trainer.py
@@ -273,9 +273,9 @@ class DataCollatorForVisionLanguageModeling(DataCollatorMixin):
     Additional keys may be present depending on the processor, such as `"image_grid_thw"`.
 
     Args:
-        processor (`ProcessorMixin`):
-            The processor used to tokenize text and process images. It must be a subclass of `ProcessorMixin` and
-            include a `tokenizer` with a defined `pad_token_id`.
+        processor ([`~transformers.ProcessorMixin`]):
+            The processor used to tokenize text and process images. It must be a subclass of
+            [`~transformers.ProcessorMixin`] and include a `tokenizer` with a defined `pad_token_id`.
         max_length (`int` or `None`, optional, defaults to `None`):
             Maximum sequence length for input tokens. If `None`, no truncation is applied.
         completion_only_loss (`bool`, *optional*, defaults to `False`):
diff --git a/trl/trainer/utils.py b/trl/trainer/utils.py
index 00c71c8c4c..dc4d9ac71b 100644
--- a/trl/trainer/utils.py
+++ b/trl/trainer/utils.py
@@ -226,7 +226,7 @@ class RewardDataCollatorWithPadding:
     `trl.trainer.reward_trainer.DataCollatorForPreference` instead.
 
     Args:
-        tokenizer (`PreTrainedTokenizerBase`):
+        tokenizer ([`~transformers.PreTrainedTokenizerBase`]):
             The tokenizer used for encoding the data.
         padding (`Union[bool, str, `PaddingStrategy`]`, `optional`, defaults to `True`):
             padding_strategy to pass to the tokenizer.
@@ -1111,7 +1111,7 @@ def generate(
             The tensor containing the input queries.
         pad_token_id (`int`):
             The token ID representing the pad token.
-        generation_config (`GenerationConfig`):
+        generation_config ([`~transformers.GenerationConfig`]):
             The configuration for the generation process.
 
     Returns:
@@ -1263,7 +1263,7 @@ def decode_and_strip_padding(inputs: torch.Tensor, tokenizer: PreTrainedTokenize
     Args:
         inputs (`torch.Tensor`):
             The input tensor to be decoded.
-        tokenizer (`transformers.PreTrainedTokenizerBase`):
+        tokenizer ([`~transformers.PreTrainedTokenizerBase`]):
             The tokenizer used to decode the input tensor.
 
     Returns:
@@ -1294,7 +1294,7 @@ def generate_model_card(
     comet_url: Optional[str] = None,
 ) -> ModelCard:
     """
-    Generate a `ModelCard` from a template.
+    Generate a [`~huggingface_hub.ModelCard`] from a template.
 
     Args:
         base_model (`str` or `None`):
@@ -1323,7 +1323,7 @@ def generate_model_card(
             ArXiv paper ID as `YYMM.NNNNN`.
 
     Returns:
-        `ModelCard`:
+        [`~huggingface_hub.ModelCard`]:
             A ModelCard object.
     """
     card_data = ModelCardData(
@@ -1377,7 +1377,7 @@ def log_table_to_comet_experiment(name: str, table: pd.DataFrame) -> None:
     Args:
         name (`str`):
             Table name.
-        table (`pd.DataFrame`):
+        table (`pandas.DataFrame`):
             The Pandas DataFrame containing the table to log.
     """
     if not is_comet_available():
diff --git a/trl/trainer/xpo_trainer.py b/trl/trainer/xpo_trainer.py
index 1a576f36b4..e289bce5bb 100644
--- a/trl/trainer/xpo_trainer.py
+++ b/trl/trainer/xpo_trainer.py
@@ -57,25 +57,26 @@ class XPOTrainer(OnlineDPOTrainer):
     It is implemented as a subclass of [`OnlineDPOTrainer`].
 
     Args:
-        model (`transformers.PreTrainedModel`):
+        model ([`~transformers.PreTrainedModel`]):
             The model to train, preferably an `AutoModelForCausalLM`.
-        ref_model (`PreTrainedModelWrapper`):
+        ref_model ([`PreTrainedModelWrapper`]):
             Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation
             and loss. If no reference model is provided, the trainer will create a reference model with the same
             architecture as the model to be optimized.
-        reward_funcs (`transformers.PreTrainedModel`):
-            The reward model to score completions with, preferably an `AutoModelForSequenceClassification`.
-        judge (`BasePairwiseJudge`):
+        reward_funcs ([`~transformers.PreTrainedModel`]):
+            The reward model to score completions with, preferably an
+            [`~transformers.AutoModelForSequenceClassification`].
+        judge ([`BasePairwiseJudge`]):
             The judge to use for pairwise comparison of model completions.
-        args (`XPOConfig`):
+        args ([`XPOConfig`]):
             The XPO config arguments to use for training.
-        data_collator (`transformers.DataCollator`):
+        data_collator ([`~transformers.DataCollator`]):
             The data collator to use for training. If None is specified, the default data collator
-            (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the
+            ([`DPODataCollatorWithPadding`]) will be used which will pad the sequences to the maximum length of the
             sequences in the batch, given a dataset of paired sequences.
-        train_dataset (`datasets.Dataset`):
+        train_dataset ([`~datasets.Dataset`]):
             The dataset to use for training.
-        eval_dataset (`datasets.Dataset`):
+        eval_dataset ([`~datasets.Dataset`]):
             The dataset to use for evaluation.
         processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*):
             Processing class used to process the data. If provided, will be used to automatically process the inputs