Skip to content

Commit e740409

Browse files
Merge branch 'main' into fix-build-kwarg-conflict
2 parents 61de0c5 + e199f19 commit e740409

27 files changed

+312
-168
lines changed

docs/source/about_dataset_features.mdx

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark:
1010
>>> from datasets import load_dataset
1111
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
1212
>>> dataset.features
13-
{'idx': Value(dtype='int32', id=None),
14-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
15-
'sentence1': Value(dtype='string', id=None),
16-
'sentence2': Value(dtype='string', id=None),
13+
{'idx': Value(dtype='int32'),
14+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
15+
'sentence1': Value(dtype='string'),
16+
'sentence2': Value(dtype='string'),
1717
}
1818
```
1919

@@ -38,11 +38,11 @@ If your data type contains a list of objects, then you want to use the [`Sequenc
3838
>>> from datasets import load_dataset
3939
>>> dataset = load_dataset('rajpurkar/squad', split='train')
4040
>>> dataset.features
41-
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
42-
'context': Value(dtype='string', id=None),
43-
'id': Value(dtype='string', id=None),
44-
'question': Value(dtype='string', id=None),
45-
'title': Value(dtype='string', id=None)}
41+
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
42+
'context': Value(dtype='string'),
43+
'id': Value(dtype='string'),
44+
'question': Value(dtype='string'),
45+
'title': Value(dtype='string')}
4646
```
4747

4848
The `answers` field is constructed using the [`Sequence`] feature because it contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.

docs/source/access.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ You can combine row and column name indexing to return a specific value at a pos
5454
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
5555
```
5656

57-
But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.
57+
Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:
5858

5959
```py
6060
>>> import time
@@ -69,7 +69,7 @@ Elapsed time: 0.0031 seconds
6969
>>> text = dataset["text"][0]
7070
>>> end_time = time.time()
7171
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
72-
Elapsed time: 0.0094 seconds
72+
Elapsed time: 0.0042 seconds
7373
```
7474

7575
### Slicing

docs/source/index.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
<img class="float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasets_logo.png"/>
44

5-
🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
5+
🤗 Datasets is a library for easily accessing and sharing AI datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
66

7-
Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
7+
Load a dataset in a single line of code, and use our powerful data processing and streaming methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
88

99
Find your dataset today on the [Hugging Face Hub](https://huggingface.co/datasets), and take an in-depth look inside of it with the live viewer.
1010

docs/source/load_hub.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 n
2020

2121
# Inspect dataset features
2222
>>> ds_builder.info.features
23-
{'label': ClassLabel(names=['neg', 'pos'], id=None),
24-
'text': Value(dtype='string', id=None)}
23+
{'label': ClassLabel(names=['neg', 'pos']),
24+
'text': Value(dtype='string')}
2525
```
2626

2727
If you're happy with the dataset, then load it with [`load_dataset`]:

docs/source/loading.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -417,6 +417,6 @@ Now when you look at your dataset features, you can see it uses the custom label
417417

418418
```py
419419
>>> dataset['train'].features
420-
{'text': Value(dtype='string', id=None),
421-
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
420+
{'text': Value(dtype='string'),
421+
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
422422
```

docs/source/package_reference/main_classes.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
112112

113113
[[autodoc]] datasets.is_caching_enabled
114114

115+
[[autodoc]] datasets.Column
116+
115117
## DatasetDict
116118

117119
Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values.
@@ -200,6 +202,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
200202
- supervised_keys
201203
- version
202204

205+
[[autodoc]] datasets.IterableColumn
206+
203207
## IterableDatasetDict
204208

205209
Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values.

docs/source/process.mdx

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -223,21 +223,21 @@ The [`~Dataset.cast`] function transforms the feature type of one or more column
223223

224224
```py
225225
>>> dataset.features
226-
{'sentence1': Value(dtype='string', id=None),
227-
'sentence2': Value(dtype='string', id=None),
228-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
229-
'idx': Value(dtype='int32', id=None)}
226+
{'sentence1': Value(dtype='string'),
227+
'sentence2': Value(dtype='string'),
228+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
229+
'idx': Value(dtype='int32')}
230230

231231
>>> from datasets import ClassLabel, Value
232232
>>> new_features = dataset.features.copy()
233233
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
234234
>>> new_features["idx"] = Value("int64")
235235
>>> dataset = dataset.cast(new_features)
236236
>>> dataset.features
237-
{'sentence1': Value(dtype='string', id=None),
238-
'sentence2': Value(dtype='string', id=None),
239-
'label': ClassLabel(names=['negative', 'positive'], id=None),
240-
'idx': Value(dtype='int64', id=None)}
237+
{'sentence1': Value(dtype='string'),
238+
'sentence2': Value(dtype='string'),
239+
'label': ClassLabel(names=['negative', 'positive']),
240+
'idx': Value(dtype='int64')}
241241
```
242242

243243
<Tip>
@@ -250,11 +250,11 @@ Use the [`~Dataset.cast_column`] function to change the feature type of a single
250250

251251
```py
252252
>>> dataset.features
253-
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
253+
{'audio': Audio(sampling_rate=44100, mono=True)}
254254

255255
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
256256
>>> dataset.features
257-
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
257+
{'audio': Audio(sampling_rate=16000, mono=True)}
258258
```
259259

260260
### Flatten
@@ -265,11 +265,11 @@ Sometimes a column can be a nested structure of several types. Take a look at th
265265
>>> from datasets import load_dataset
266266
>>> dataset = load_dataset("rajpurkar/squad", split="train")
267267
>>> dataset.features
268-
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
269-
'context': Value(dtype='string', id=None),
270-
'id': Value(dtype='string', id=None),
271-
'question': Value(dtype='string', id=None),
272-
'title': Value(dtype='string', id=None)}
268+
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
269+
'context': Value(dtype='string'),
270+
'id': Value(dtype='string'),
271+
'question': Value(dtype='string'),
272+
'title': Value(dtype='string')}
273273
```
274274

275275
The `answers` field contains two subfields: `text` and `answer_start`. Use the [`~Dataset.flatten`] function to extract the subfields into their own separate columns:
@@ -810,12 +810,18 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
810810

811811
Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
812812

813-
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
813+
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]:
814814

815815
```python
816816
encoded_dataset.push_to_hub("username/my_dataset")
817817
```
818818

819+
You can use multiple processes to upload it in parallel. This is especially useful if you want to speed up the process:
820+
821+
```python
822+
dataset.push_to_hub("username/my_dataset", num_proc=8)
823+
```
824+
819825
Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
820826

821827
```python

docs/source/quickstart.mdx

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -312,9 +312,9 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
312312
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
313313
'label': 1,
314314
'idx': 0,
315-
'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]),
316-
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
317-
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
315+
'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, ...],
316+
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...],
317+
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...]}
318318
```
319319

320320
**4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
@@ -327,12 +327,13 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
327327

328328
<frameworkcontent>
329329
<pt>
330-
Use the [`~Dataset.set_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
330+
Use the [`~Dataset.with_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
331331

332332
```py
333333
>>> import torch
334334

335-
>>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
335+
>>> dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
336+
>>> dataset = dataset.with_format(type="torch")
336337
>>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
337338
```
338339
</pt>

docs/source/stream.mdx

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -241,21 +241,21 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum
241241
>>> from datasets import load_dataset
242242
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True)
243243
>>> dataset.features
244-
{'sentence1': Value(dtype='string', id=None),
245-
'sentence2': Value(dtype='string', id=None),
246-
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
247-
'idx': Value(dtype='int32', id=None)}
244+
{'sentence1': Value(dtype='string'),
245+
'sentence2': Value(dtype='string'),
246+
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
247+
'idx': Value(dtype='int32')}
248248

249249
>>> from datasets import ClassLabel, Value
250250
>>> new_features = dataset.features.copy()
251251
>>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
252252
>>> new_features["idx"] = Value('int64')
253253
>>> dataset = dataset.cast(new_features)
254254
>>> dataset.features
255-
{'sentence1': Value(dtype='string', id=None),
256-
'sentence2': Value(dtype='string', id=None),
257-
'label': ClassLabel(names=['negative', 'positive'], id=None),
258-
'idx': Value(dtype='int64', id=None)}
255+
{'sentence1': Value(dtype='string'),
256+
'sentence2': Value(dtype='string'),
257+
'label': ClassLabel(names=['negative', 'positive']),
258+
'idx': Value(dtype='int64')}
259259
```
260260

261261
<Tip>
@@ -268,11 +268,11 @@ Use [`IterableDataset.cast_column`] to change the feature type of just one colum
268268

269269
```py
270270
>>> dataset.features
271-
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
271+
{'audio': Audio(sampling_rate=44100, mono=True)}
272272

273273
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
274274
>>> dataset.features
275-
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
275+
{'audio': Audio(sampling_rate=16000, mono=True)}
276276
```
277277

278278
## Map
@@ -517,6 +517,12 @@ Save your dataset by providing the name of the dataset repository on Hugging Fac
517517
dataset.push_to_hub("username/my_dataset")
518518
```
519519

520+
If the dataset consists of multiple shards (`dataset.num_shards > 1`), you can use multiple processes to upload it in parallel. This is especially useful if you applied `map()` or `filter()` steps since they will run faster in parallel:
521+
522+
```python
523+
dataset.push_to_hub("username/my_dataset", num_proc=8)
524+
```
525+
520526
Use the [`load_dataset`] function to reload the dataset:
521527

522528
```python

docs/source/use_dataset.mdx

Lines changed: 47 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -175,22 +175,61 @@ Most image models expect the image to be in the RGB mode. The Beans images are a
175175
>>> dataset = dataset.cast_column("image", Image(mode="RGB"))
176176
```
177177

178-
**3**. Now, you can apply some transforms to the image. Feel free to take a look at the [various transforms available](https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py) in torchvision and choose one you'd like to experiment with. This example applies a transform that randomly rotates the image:
178+
**3**. Now let's apply data augmentations to your images. 🤗 Datasets works with any augmentation library, and in this example we'll use Albumentations.
179+
180+
[Albumentations](https://albumentations.ai) is a popular image augmentation library that provides a [rich set of transforms](https://albumentations.ai/docs/reference/supported-targets-by-transform/) including spatial-level transforms, pixel-level transforms, and mixing-level transforms.
181+
182+
Install Albumentations:
183+
184+
```bash
185+
pip install albumentations
186+
```
187+
188+
**4**. Create a typical augmentation pipeline with Albumentations:
179189

180190
```py
181-
>>> from torchvision.transforms import RandomRotation
191+
>>> import albumentations as A
192+
>>> import numpy as np
193+
>>> from PIL import Image
194+
195+
>>> transform = A.Compose([
196+
... A.RandomCrop(height=256, width=256, pad_if_needed=True, p=1),
197+
... A.HorizontalFlip(p=0.5),
198+
... A.ColorJitter(p=0.5)
199+
... ])
200+
```
201+
202+
**5**. Since 🤗 Datasets uses PIL images but Albumentations expects NumPy arrays, you need to convert between formats:
182203

183-
>>> rotate = RandomRotation(degrees=(0, 90))
184-
>>> def transforms(examples):
185-
... examples["pixel_values"] = [rotate(image) for image in examples["image"]]
204+
```py
205+
>>> def albumentations_transforms(examples):
206+
... # Apply Albumentations transforms
207+
... transformed_images = []
208+
... for image in examples["image"]:
209+
... # Convert PIL to numpy array (OpenCV format)
210+
... image_np = np.array(image.convert("RGB"))
211+
...
212+
... # Apply Albumentations transforms
213+
... transformed_image = transform(image=image_np)["image"]
214+
...
215+
... # Convert back to PIL Image
216+
... pil_image = Image.fromarray(transformed_image)
217+
... transformed_images.append(pil_image)
218+
...
219+
... examples["pixel_values"] = transformed_images
186220
... return examples
187221
```
188222

189-
**4**. Use the [`~Dataset.set_transform`] function to apply the transform on-the-fly. When you index into the image `pixel_values`, the transform is applied, and your image gets rotated.
223+
**6**. Apply the transform using [`~Dataset.with_transform`]:
190224

191225
```py
192-
>>> dataset.set_transform(transforms)
226+
>>> dataset = dataset.with_transform(albumentations_transforms)
193227
>>> dataset[0]["pixel_values"]
194228
```
195229

196-
**5**. The dataset is now ready for training with your machine learning framework!
230+
**Key points when using Albumentations with 🤗 Datasets:**
231+
- Convert PIL images to NumPy arrays before applying transforms
232+
- Albumentations returns a dictionary with the transformed image under the "image" key
233+
- Convert the result back to PIL format after transformation
234+
235+
**7**. The dataset is now ready for training with your machine learning framework!

0 commit comments

Comments
 (0)