Dataset interface #377

jlamypoirier · 2025-10-15T02:54:53Z

✨ Description

Part of the data rework:

Drop the generic / gpt specialization of indexed dataset types. Instead, use generics to specify a sample type.
Add Sample and Batch constructs (placeholders for now)
Remove the tokenizer from GPTData, since it's rarely used and not by the data itself. Instead, use separate tokenizers where needed (Fim, Preparator (already present), lm eval).
Start moving away from numpy in favor of torch so things are more uniform (ex. Sample and Batch both use torch tensors.)
Other minor tweaks.

Note: since this is part of a bigger set of changes, it does contain changes that don't immediately make sense but will be useful layer, as well as messy temporary solutions. (See #376 for more info on where this is going).

tscholak · 2025-10-15T14:02:48Z

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

jlamypoirier · 2025-10-15T19:18:41Z

Great to carve this out!

I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data?

I'm hoping to have the new format ready this week and make a big announcement. I'm not currently planning backward compatibility (time issue), but could if that's a necessity.

tscholak · 2025-10-15T21:45:25Z

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

jlamypoirier · 2025-10-16T04:00:22Z

Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out

I don't really think it's worth it, might as well just redo the preparation. To help with the transition I'm noticing that the intermediate memmap dataset I'm making in #378 will essentially support the old binary format with the updated code, so an I could just keep it for a while as a backward compatibility backup. (Except for vision datasets of course.)

jlamypoirier added 2 commits October 14, 2025 22:52

Dataset interface

1a18929

misc

fd63846

fix

2486caf

jlamypoirier marked this pull request as ready for review October 16, 2025 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset interface #377

Dataset interface #377

Uh oh!

jlamypoirier commented Oct 15, 2025 •

edited

Loading

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 15, 2025

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dataset interface #377

Are you sure you want to change the base?

Dataset interface #377

Uh oh!

Conversation

jlamypoirier commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 15, 2025

Uh oh!

tscholak commented Oct 15, 2025

Uh oh!

jlamypoirier commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jlamypoirier commented Oct 15, 2025 •

edited

Loading