No “emergent behavior / aha moment” when retraining GPT-2 on FineWeb; warmup / “warm training” guidance requested #889

talentJay-ux · 2025-10-20T03:17:08Z

talentJay-ux
Oct 20, 2025

Summary
I re-trained a GPT-2-style model from random parameters using HuggingFaceFW/FineWeb dataset. Training/validation loss plateau around ~4 and don’t exhibit a sudden drop; evaluation generations repeat heavily. I’m looking for guidance on “warm training” parameters and optimization suggestions for training the model from scratch

Environment & Model

Model size: GPT-2 124M (12×768, n_heads=12, n_layers=12)
Context length: 1024
Tokenizer: GPT-2 BPE (vocab_size=50304, includes <|endoftext|>)
Init: random
Framework: PyTorch
Precision: bf16-mixed (CUDA Tensor Cores enabled)
Optimizer: AdamW (weight_decay=0.1)
Scheduler: cosine decay with linear warmup
Hardware: h100

Data

Dataset: HuggingFaceFW/fineweb (streaming)
Filter: English only with language_score ≥ 0.9
Chunking: 1024-token blocks, causal next-token prediction, append <|endoftext|> between documents

Dataloader (key settings)

Batch size (minibatch): 64 sequences × 1024 tokens = 65,536 tokens/step
Workers / pin memory: num_workers=4, pin_memory=True
Validation sampling: every 1% of stream (val_mod=100), fixed eval loaders for stable loss

Run stats (this run)

Total optimizer updates: ≈ 29,153 steps
Total tokens seen: ≈ 3.82B tokens
Gradient Accumulation: none (grad_accum=1)
Dropout: 0.1 in attn/MLP/residual dropout

What I observe

Slow progress & plateau: Training and validation loss decrease at first but plateau around ~4 and don’t improve further even after billions of tokens.
Spikes: There are sudden train-loss spikes; after each spike it takes many steps to return to the prior level. My hunch is a data quality / distribution issue (e.g., very long sequences with lots of rare tokens, HTML/code bursts, or bad shards).
Samples repeat: Greedy samples (and even with small temperature) devolve into repetitive loops (“You will be able to see the world …”), with no clear “aha” jumps in coherence.
No emergence: I don’t see non-linear jumps on small reasoning or instruction-following probes—curves are smooth/flat.

Reproduction (minimal)

Model/blocks largely follow Sebastian Raschka’s book “Build a Large Language Model From Scratch”

# Key hyperparameters
GPT_CONFIG_124M = dict(vocab_size=50304, context_length=1024,
                       emb_dim=768, n_heads=12, n_layers=12,
                       drop_rate=0.1, qkv_bias=False)

OTHER_SETTINGS = dict(learning_rate=5e-4,      # peak LR unless warmup overrides
                      batch_size=64,            # minibatch size
                      num_epochs=1,
                      weight_decay=0.1)

# Dataset: HuggingFaceFW/fineweb (streaming), English filter, token blocks of 1024,
# add <|endoftext|>, DataLoader(num_workers=4, pin_memory=True)

(Happy to post a full runnable script if helpful.)

Questions / requests for guidance

A. Gradient normalization

grad_norm_pre  = clip_grad_norm_(model.parameters(), float("inf")).item()
clip_grad_norm_(model.parameters(), max_norm=1.0)
grad_norm_post = clip_grad_norm_(model.parameters(), float("inf")).item()

Still couldn't not avoid the sudden spikes of the loss spikes, do you recommend other methods? For example, I could try to drop the update entirely, if the training loss is too big.

B. Data

Do you suggest me to switch to a different dataset that is better suited for the LLM training than HuggingFaceFW/fineweb? Sorry I didn't pause the training and root cause what exactly happened when the loss suddenly spikes. And I didn't find out which data records exactly caused this.

C. “Emergence” expectations

Is it realistic to expect noticeable “aha-like” jumps from pure pertaining at 124M? Do you think after resolving the sudden loss spikes, with 3.82B tokens seen, we can reproduce a modal matching the performance of GPT-2 124M? My greedy based model faced heavy repetition pattern, even after 3.82B of token seen. And I didn't observe large difference with model trained after 10000 steps of 28000 steps.

D. Warm training suggestions
I would pike up some the model and train on top of it, do you have learning rate suggestions when doing warm training?

E. cross-entropy loss
What would be a good expectation for the loss? Random baseline: ≈ ln(50304) ≈ 10.83 nats. GPT-2 ~124–128M (well-trained on WebText-like): ≈ 3.4–3.8 (PPL ≈ 30–45). Given the above, is ~3.5–3.8 a reasonable validation loss target for GPT-2-124M on FineWeb?

Thank you!

talentJay-ux · 2025-10-20T03:18:19Z

talentJay-ux
Oct 20, 2025
Author

Tensorboard metrics showing the training loss, validation loss, learning rate and token/seen

0 replies

talentJay-ux · 2025-10-20T03:20:14Z

talentJay-ux
Oct 20, 2025
Author

Sample response for model after certain steps, for finishing up the sentence with start context of
Every effort moves you

0 replies

rasbt · 2025-10-21T01:15:38Z

rasbt
Oct 21, 2025
Maintainer

Thanks for sharing this very interesting discussion!

Regarding your points:

A. This looks like a reasonable mod. I have a suggestion below regarding QK-Norm that might additionally help.

B. I would stick with it for now, but maybe in the next run you could print the large gradient/high loss samples to further investigate

C. I am not sure if you'd see it with this small model, but it should match the published GPT-2 I'd say.

D. Usually this is done with re-warming. I briefly wrote about it here [https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms] based on the Simple and Scalable Strategies to Continually Pre-train Large Language Models paper.

E. Yes, it would be reasonable to expect the same loss. What I would do is to take a sample from a news article that wasn't in the training data and then calculate the loss or perplexity for the base GPT-2 127M and then do this periodically for your trained model. Something that we know could not have been in the training data. For example, from a WSJ article today:

An Amazon Web Services outage hit major sites and apps, disrupting retail sales, social media, financial services and more.
The incident could cost billions in lost sales and cause disruption through supply-chain issues, according to home-delivery service Parcelhero. The AWS infrastructure provides cloud-computing services such as servers and storage to the world’s biggest companies. Among those affected were Facebook, Amazon, Fidelity, Coinbase, Slack, United Airlines, Roblox and WSJ. (But I still wrote this newsletter for you, dear reader.) The outage began at 3 a.m. ET.

This would maybe help to more fairly compare the two models to each other. E.g., via

PS: I am getting

GPT-2 127M

Loss: 3.6041
Perplexity: 36.7497

gpt2-medium (355M)

Loss: 3.3444
Perplexity: 28.3446

Qwen 0.6B Base

Loss: 3.3281
Perplexity: 27.8750

I have a few questions too if you don't mind:

Did you use the vanilla model from Chapter 4 or did you use the optimized version from my bonus materials at https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/10_llm-training-speed?
How long did it approximately take to train on the 3.82B tokens?

Regarding tips:

I agree that these spikes could come from the data. But that being said there are maybe some improvements I would tr..

The important thing, if you have the budget and time, is to try one thing at a time so you can see where the differences come from. Some suggestions are:

F. I would probably remove dropout (or set it to 0.); in my experience it doesn't help and may make things even worse

G. Another one would be to add QK-Norm like in Qwen3 here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/11_qwen3/standalone-qwen3.ipynb

class MultiHeadAttention(nn.Module):

    def __init__(...):
        if qk_norm:
            self.q_norm = LayerNorm(head_dim)
            self.k_norm = LayerNorm(head_dim)

    def forward(...):
        #...
        queries = self.q_norm(queries)
        keys = self.k_norm(keys)
        attn_scores = queries @ keys.transpose(2, 3)
        #...

Usually, QK-Norm is nowadays implemented with RMSNorm, but for consistency, I would try LayerNorm first. Maybe it gets rid of the spikes.

H. I would also be curious how Qwen3 0.6B performs in terms of smoothness if it is not too large to run. You could technically shrink it by reducing the number of layers. The code there in the notebook should work as a drop-in replacement for the GPT model.

If all that doesn't work, it might be dataset or optimizer and learning rate schedule related. But out of curiosity I would try these things above first. I'd be curious what the results are.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

No “emergent behavior / aha moment” when retraining GPT-2 on FineWeb; warmup / “warm training” guidance requested #889

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

No “emergent behavior / aha moment” when retraining GPT-2 on FineWeb; warmup / “warm training” guidance requested #889

Uh oh!

talentJay-ux Oct 20, 2025

Environment & Model

Data

Dataloader (key settings)

Run stats (this run)

What I observe

Reproduction (minimal)

Questions / requests for guidance

Replies: 3 comments

Uh oh!

talentJay-ux Oct 20, 2025 Author

Uh oh!

talentJay-ux Oct 20, 2025 Author

Uh oh!

rasbt Oct 21, 2025 Maintainer

talentJay-ux
Oct 20, 2025

talentJay-ux
Oct 20, 2025
Author

talentJay-ux
Oct 20, 2025
Author

rasbt
Oct 21, 2025
Maintainer