Docs and more methods for IterableDataset: push_to_hub, to_parquet... (#7604)

lhoestq · web-flow · commit 8e61377c5df9 · 2025-06-10T15:15:21.000+02:00
docs and more methods
diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
@@ -175,7 +175,12 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - take
     - shard
     - repeat
+    - to_csv
+    - to_pandas
+    - to_dict
+    - to_json
     - to_parquet
+    - to_sql
     - push_to_hub
     - load_state_dict
     - state_dict
diff --git a/docs/source/process.mdx b/docs/source/process.mdx
@@ -808,17 +808,28 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
 
 ## Save
 
-Once you are done processing your dataset, you can save and reuse it later with [`~Dataset.save_to_disk`].
+Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
 
-Save your dataset by providing the path to the directory you wish to save it to:
+Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
 
-```py
->>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
+```python
+encoded_dataset.push_to_hub("username/my_dataset")
 ```
 
-Use the [`load_from_disk`] function to reload the dataset:
+Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
+
+```python
+from datasets import load_dataset
+reloaded_dataset = load_dataset("username/my_dataset", streaming=True)
+```
+
+Alternatively, you can save it locally in Arrow format on disk. Compared to Parquet, Arrow is uncompressed which makes it much faster to reload which is great for local use on disk and ephemeral caching. But since it's larger and with less metadata, it is slower to upload/download/query than Parquet and less suited for long term storage.
+
+Use the [`~Dataset.save_to_disk`] and [`load_from_disk`] function to reload the dataset from your disk:
 
 ```py
+>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
+>>> # later
 >>> from datasets import load_from_disk
 >>> reloaded_dataset = load_from_disk("path/of/my/dataset/directory")
 ```
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -51,6 +51,17 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map
 
 </Tip>
 
+
+## Column indexing
+
+Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
+```python
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
+>>> print(next(iter(dataset["text"])))
+Beginners BBQ Class Taking Place in Missoula!...
+```
+
 ## Convert from a Dataset
 
 If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files.
@@ -495,12 +506,47 @@ Resuming returns exactly where the checkpoint was saved except if `.shuffle()` i
 
 </Tip>
 
-## Column indexing
 
-Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
+## Save
+
+Once your iterable dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
+
+Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]. This iterates over the dataset and progressively uploads the data to Hugging Face:
+
 ```python
->>> from datasets import load_dataset
->>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
->>> print(next(iter(dataset["text"])))
-Beginners BBQ Class Taking Place in Missoula!...
-```
+dataset.push_to_hub("username/my_dataset")
+```
+
+Use the [`load_dataset`] function to reload the dataset:
+
+```python
+from datasets import load_dataset
+reloaded_dataset = load_dataset("username/my_dataset")
+```
+
+## Export
+
+🤗 Datasets supports exporting as well so you can work with your dataset in other applications. The following table shows currently supported file formats you can export to:
+
+| File type               | Export method                                                  |
+|-------------------------|----------------------------------------------------------------|
+| CSV                     | [`IterableDataset.to_csv`]                                    |
+| JSON                    | [`IterableDataset.to_json`]                                   |
+| Parquet                 | [`IterableDataset.to_parquet`]                                |
+| SQL                     | [`IterableDataset.to_sql`]                                    |
+| In-memory Python object | [`IterableDataset.to_pandas`], [`IterableDataset.to_polars`] or [`IterableDataset.to_dict`] |
+
+For example, export your dataset to a CSV file like this:
+
+```py
+>>> dataset.to_csv("path/of/my/dataset.csv")
+```
+
+If you have a large dataset, you can save one file per shard, e.g.
+
+```py
+>>> num_shards = dataset.num_shards
+>>> for index in range(num_shards):
+...     shard = dataset.shard(index, num_shards)
+...     shard.to_parquet(f"path/of/my/dataset/data-{index:05d}.parquet")
+```
diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py
@@ -4937,12 +4937,15 @@ def to_csv(
             **to_csv_kwargs,
         ).write()
 
-    def to_dict(self, batch_size: Optional[int] = None) -> Union[dict, Iterator[dict]]:
+    def to_dict(self, batch_size: Optional[int] = None, batched: bool = False) -> Union[dict, Iterator[dict]]:
         """Returns the dataset as a Python dict. Can also return a generator for large datasets.
 
         Args:
             batch_size (`int`, *optional*): The size (number of rows) of the batches if `batched` is `True`.
                 Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
+            batched (`bool`):
+                Set to `True` to return a generator that yields the dataset as batches
+                of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
 
         Returns:
             `dict` or `Iterator[dict]`
@@ -5045,12 +5048,12 @@ def to_pandas(
         """Returns the dataset as a `pandas.DataFrame`. Can also return a generator for large datasets.
 
         Args:
-            batched (`bool`):
-                Set to `True` to return a generator that yields the dataset as batches
-                of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
             batch_size (`int`, *optional*):
                 The size (number of rows) of the batches if `batched` is `True`.
                 Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
+            batched (`bool`):
+                Set to `True` to return a generator that yields the dataset as batches
+                of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
 
         Returns:
             `pandas.DataFrame` or `Iterator[pandas.DataFrame]`
@@ -5088,12 +5091,12 @@ def to_polars(
         """Returns the dataset as a `polars.DataFrame`. Can also return a generator for large datasets.
 
         Args:
-            batched (`bool`):
-                Set to `True` to return a generator that yields the dataset as batches
-                of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
             batch_size (`int`, *optional*):
                 The size (number of rows) of the batches if `batched` is `True`.
                 Defaults to `genomicsml.datasets.config.DEFAULT_MAX_BATCH_SIZE`.
+            batched (`bool`):
+                Set to `True` to return a generator that yields the dataset as batches
+                of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
             schema_overrides (`dict`, *optional*):
                 Support type specification or override of one or more columns; note that
                 any dtypes inferred from the schema param will be overridden.
diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py