Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Oct 15, 2025

✨ Description

Part of the data rework:

  • Drop the generic / gpt specialization of indexed dataset types. Instead, use generics to specify a sample type.
  • Add Sample and Batch constructs (placeholders for now)
  • Remove the tokenizer from GPTData, since it's rarely used and not by the data itself. Instead, use separate tokenizers where needed (Fim, Preparator (already present), lm eval).
  • Start moving away from numpy in favor of torch so things are more uniform (ex. Sample and Batch both use torch tensors.)
  • Other minor tweaks.

Note: since this is part of a bigger set of changes, it does contain changes that don't immediately make sense but will be useful layer, as well as messy temporary solutions. (See #376 for more info on where this is going).

@tscholak
Copy link
Collaborator

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

@jlamypoirier
Copy link
Collaborator Author

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

I'm hoping to have the new format ready this week and make a big announcement. I'm not currently planning backward compatibility (time issue), but could if that's a necessity.

@tscholak
Copy link
Collaborator

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

@jlamypoirier jlamypoirier marked this pull request as ready for review October 16, 2025 03:52
@jlamypoirier
Copy link
Collaborator Author

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

I don't really think it's worth it, might as well just redo the preparation. To help with the transition I'm noticing that the intermediate memmap dataset I'm making in #378 will essentially support the old binary format with the updated code, so an I could just keep it for a while as a backward compatibility backup. (Except for vision datasets of course.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants