Skip to content

Commit 8e61377

Browse files
authored
Docs and more methods for IterableDataset: push_to_hub, to_parquet... (#7604)
docs and more methods
1 parent 784607d commit 8e61377

File tree

5 files changed

+343
-25
lines changed

5 files changed

+343
-25
lines changed

docs/source/package_reference/main_classes.mdx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,12 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
175175
- take
176176
- shard
177177
- repeat
178+
- to_csv
179+
- to_pandas
180+
- to_dict
181+
- to_json
178182
- to_parquet
183+
- to_sql
179184
- push_to_hub
180185
- load_state_dict
181186
- state_dict

docs/source/process.mdx

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -808,17 +808,28 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
808808

809809
## Save
810810

811-
Once you are done processing your dataset, you can save and reuse it later with [`~Dataset.save_to_disk`].
811+
Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
812812

813-
Save your dataset by providing the path to the directory you wish to save it to:
813+
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
814814

815-
```py
816-
>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
815+
```python
816+
encoded_dataset.push_to_hub("username/my_dataset")
817817
```
818818

819-
Use the [`load_from_disk`] function to reload the dataset:
819+
Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
820+
821+
```python
822+
from datasets import load_dataset
823+
reloaded_dataset = load_dataset("username/my_dataset", streaming=True)
824+
```
825+
826+
Alternatively, you can save it locally in Arrow format on disk. Compared to Parquet, Arrow is uncompressed which makes it much faster to reload which is great for local use on disk and ephemeral caching. But since it's larger and with less metadata, it is slower to upload/download/query than Parquet and less suited for long term storage.
827+
828+
Use the [`~Dataset.save_to_disk`] and [`load_from_disk`] function to reload the dataset from your disk:
820829

821830
```py
831+
>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
832+
>>> # later
822833
>>> from datasets import load_from_disk
823834
>>> reloaded_dataset = load_from_disk("path/of/my/dataset/directory")
824835
```

docs/source/stream.mdx

Lines changed: 53 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,17 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map
5151

5252
</Tip>
5353

54+
55+
## Column indexing
56+
57+
Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
58+
```python
59+
>>> from datasets import load_dataset
60+
>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
61+
>>> print(next(iter(dataset["text"])))
62+
Beginners BBQ Class Taking Place in Missoula!...
63+
```
64+
5465
## Convert from a Dataset
5566

5667
If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files.
@@ -495,12 +506,47 @@ Resuming returns exactly where the checkpoint was saved except if `.shuffle()` i
495506

496507
</Tip>
497508

498-
## Column indexing
499509

500-
Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
510+
## Save
511+
512+
Once your iterable dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
513+
514+
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]. This iterates over the dataset and progressively uploads the data to Hugging Face:
515+
501516
```python
502-
>>> from datasets import load_dataset
503-
>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
504-
>>> print(next(iter(dataset["text"])))
505-
Beginners BBQ Class Taking Place in Missoula!...
506-
```
517+
dataset.push_to_hub("username/my_dataset")
518+
```
519+
520+
Use the [`load_dataset`] function to reload the dataset:
521+
522+
```python
523+
from datasets import load_dataset
524+
reloaded_dataset = load_dataset("username/my_dataset")
525+
```
526+
527+
## Export
528+
529+
🤗 Datasets supports exporting as well so you can work with your dataset in other applications. The following table shows currently supported file formats you can export to:
530+
531+
| File type | Export method |
532+
|-------------------------|----------------------------------------------------------------|
533+
| CSV | [`IterableDataset.to_csv`] |
534+
| JSON | [`IterableDataset.to_json`] |
535+
| Parquet | [`IterableDataset.to_parquet`] |
536+
| SQL | [`IterableDataset.to_sql`] |
537+
| In-memory Python object | [`IterableDataset.to_pandas`], [`IterableDataset.to_polars`] or [`IterableDataset.to_dict`] |
538+
539+
For example, export your dataset to a CSV file like this:
540+
541+
```py
542+
>>> dataset.to_csv("path/of/my/dataset.csv")
543+
```
544+
545+
If you have a large dataset, you can save one file per shard, e.g.
546+
547+
```py
548+
>>> num_shards = dataset.num_shards
549+
>>> for index in range(num_shards):
550+
... shard = dataset.shard(index, num_shards)
551+
... shard.to_parquet(f"path/of/my/dataset/data-{index:05d}.parquet")
552+
```

src/datasets/arrow_dataset.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4937,12 +4937,15 @@ def to_csv(
49374937
**to_csv_kwargs,
49384938
).write()
49394939

4940-
def to_dict(self, batch_size: Optional[int] = None) -> Union[dict, Iterator[dict]]:
4940+
def to_dict(self, batch_size: Optional[int] = None, batched: bool = False) -> Union[dict, Iterator[dict]]:
49414941
"""Returns the dataset as a Python dict. Can also return a generator for large datasets.
49424942
49434943
Args:
49444944
batch_size (`int`, *optional*): The size (number of rows) of the batches if `batched` is `True`.
49454945
Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
4946+
batched (`bool`):
4947+
Set to `True` to return a generator that yields the dataset as batches
4948+
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
49464949
49474950
Returns:
49484951
`dict` or `Iterator[dict]`
@@ -5045,12 +5048,12 @@ def to_pandas(
50455048
"""Returns the dataset as a `pandas.DataFrame`. Can also return a generator for large datasets.
50465049
50475050
Args:
5048-
batched (`bool`):
5049-
Set to `True` to return a generator that yields the dataset as batches
5050-
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
50515051
batch_size (`int`, *optional*):
50525052
The size (number of rows) of the batches if `batched` is `True`.
50535053
Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
5054+
batched (`bool`):
5055+
Set to `True` to return a generator that yields the dataset as batches
5056+
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
50545057
50555058
Returns:
50565059
`pandas.DataFrame` or `Iterator[pandas.DataFrame]`
@@ -5088,12 +5091,12 @@ def to_polars(
50885091
"""Returns the dataset as a `polars.DataFrame`. Can also return a generator for large datasets.
50895092
50905093
Args:
5091-
batched (`bool`):
5092-
Set to `True` to return a generator that yields the dataset as batches
5093-
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
50945094
batch_size (`int`, *optional*):
50955095
The size (number of rows) of the batches if `batched` is `True`.
50965096
Defaults to `genomicsml.datasets.config.DEFAULT_MAX_BATCH_SIZE`.
5097+
batched (`bool`):
5098+
Set to `True` to return a generator that yields the dataset as batches
5099+
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
50975100
schema_overrides (`dict`, *optional*):
50985101
Support type specification or override of one or more columns; note that
50995102
any dtypes inferred from the schema param will be overridden.

0 commit comments

Comments
 (0)