huggingface
diff --git a/‎docs/source/about_dataset_features.mdx
Lines changed: 9 additions & 9 deletions b/‎docs/source/about_dataset_features.mdx
Lines changed: 9 additions & 9 deletions
diff --git a/‎docs/source/access.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/access.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/index.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/index.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/load_hub.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/load_hub.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/loading.mdx
Lines changed: 2 additions & 2 deletions b/‎docs/source/loading.mdx
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/package_reference/main_classes.mdx
Lines changed: 4 additions & 0 deletions b/‎docs/source/package_reference/main_classes.mdx
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/process.mdx
Lines changed: 22 additions & 16 deletions b/‎docs/source/process.mdx
Lines changed: 22 additions & 16 deletions
diff --git a/‎docs/source/quickstart.mdx
Lines changed: 6 additions & 5 deletions b/‎docs/source/quickstart.mdx
Lines changed: 6 additions & 5 deletions
diff --git a/‎docs/source/stream.mdx
Lines changed: 16 additions & 10 deletions b/‎docs/source/stream.mdx
Lines changed: 16 additions & 10 deletions
diff --git a/‎docs/source/use_dataset.mdx
Lines changed: 47 additions & 8 deletions b/‎docs/source/use_dataset.mdx
Lines changed: 47 additions & 8 deletions
@@ -10,10 +10,10 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark:
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
 >>> dataset.features
-{'idx': Value(dtype='int32', id=None),
- 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
- 'sentence1': Value(dtype='string', id=None),
- 'sentence2': Value(dtype='string', id=None),
+{'idx': Value(dtype='int32'),
+ 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+ 'sentence1': Value(dtype='string'),
+ 'sentence2': Value(dtype='string'),
 }
 ```
 
@@ -38,11 +38,11 @@ If your data type contains a list of objects, then you want to use the [`Sequenc
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('rajpurkar/squad', split='train')
 >>> dataset.features
-{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
-'context': Value(dtype='string', id=None),
-'id': Value(dtype='string', id=None),
-'question': Value(dtype='string', id=None),
-'title': Value(dtype='string', id=None)}
+{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
+'context': Value(dtype='string'),
+'id': Value(dtype='string'),
+'question': Value(dtype='string'),
+'title': Value(dtype='string')}
 ```
 
 The `answers` field is constructed using the [`Sequence`] feature because it contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.
 
@@ -54,7 +54,7 @@ You can combine row and column name indexing to return a specific value at a pos
 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
 ```
 
-But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.
+Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:
 
 ```py
 >>> import time
@@ -69,7 +69,7 @@ Elapsed time: 0.0031 seconds
 >>> text = dataset["text"][0]
 >>> end_time = time.time()
 >>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
-Elapsed time: 0.0094 seconds
+Elapsed time: 0.0042 seconds
 ```
 
 ### Slicing
 
@@ -2,9 +2,9 @@
 
 <img class="float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasets_logo.png"/>
 
-🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
+🤗 Datasets is a library for easily accessing and sharing AI datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
 
-Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
+Load a dataset in a single line of code, and use our powerful data processing and streaming methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
 
 Find your dataset today on the [Hugging Face Hub](https://huggingface.co/datasets), and take an in-depth look inside of it with the live viewer.
 
 
@@ -20,8 +20,8 @@ Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 n
 
 # Inspect dataset features
 >>> ds_builder.info.features
-{'label': ClassLabel(names=['neg', 'pos'], id=None),
- 'text': Value(dtype='string', id=None)}
+{'label': ClassLabel(names=['neg', 'pos']),
+ 'text': Value(dtype='string')}
 ```
 
 If you're happy with the dataset, then load it with [`load_dataset`]:
 
@@ -417,6 +417,6 @@ Now when you look at your dataset features, you can see it uses the custom label
 
 ```py
 >>> dataset['train'].features
-{'text': Value(dtype='string', id=None),
-'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
+{'text': Value(dtype='string'),
+'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
 ```
@@ -112,6 +112,8 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
 
 [[autodoc]] datasets.is_caching_enabled
 
+[[autodoc]] datasets.Column
+
 ## DatasetDict
 
 Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values.
@@ -200,6 +202,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - supervised_keys
     - version
 
+[[autodoc]] datasets.IterableColumn
+
 ## IterableDatasetDict
 
 Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values.
 
@@ -223,21 +223,21 @@ The [`~Dataset.cast`] function transforms the feature type of one or more column
 
 ```py
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
-'idx': Value(dtype='int32', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+'idx': Value(dtype='int32')}
 
 >>> from datasets import ClassLabel, Value
 >>> new_features = dataset.features.copy()
 >>> new_features["label"] = ClassLabel(names=["negative", "positive"])
 >>> new_features["idx"] = Value("int64")
 >>> dataset = dataset.cast(new_features)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['negative', 'positive'], id=None),
-'idx': Value(dtype='int64', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['negative', 'positive']),
+'idx': Value(dtype='int64')}
 ```
 
 <Tip>
@@ -250,11 +250,11 @@ Use the [`~Dataset.cast_column`] function to change the feature type of a single
 
 ```py
 >>> dataset.features
-{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
+{'audio': Audio(sampling_rate=44100, mono=True)}
 
 >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 >>> dataset.features
-{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
+{'audio': Audio(sampling_rate=16000, mono=True)}
 ```
 
 ### Flatten
@@ -265,11 +265,11 @@ Sometimes a column can be a nested structure of several types. Take a look at th
 >>> from datasets import load_dataset
 >>> dataset = load_dataset("rajpurkar/squad", split="train")
 >>> dataset.features
-{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
-'context': Value(dtype='string', id=None),
-'id': Value(dtype='string', id=None),
-'question': Value(dtype='string', id=None),
-'title': Value(dtype='string', id=None)}
+{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
+'context': Value(dtype='string'),
+'id': Value(dtype='string'),
+'question': Value(dtype='string'),
+'title': Value(dtype='string')}
 ```
 
 The `answers` field contains two subfields: `text` and `answer_start`. Use the [`~Dataset.flatten`] function to extract the subfields into their own separate columns:
@@ -810,12 +810,18 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio
 
 Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].
 
-Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
+Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]:
 
 ```python
 encoded_dataset.push_to_hub("username/my_dataset")
 ```
 
+You can use multiple processes to upload it in parallel. This is especially useful if you want to speed up the process:
+
+```python
+dataset.push_to_hub("username/my_dataset", num_proc=8)
+```
+
 Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):
 
 ```python
 
@@ -312,9 +312,9 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
-'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
-'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
-'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
+'input_ids': [  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 0, 0, ...],
+'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...],
+'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...]}
 ```
 
 **4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
@@ -327,12 +327,13 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
 
 <frameworkcontent>
 <pt>
-Use the [`~Dataset.set_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
+Use the [`~Dataset.with_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
 
 ```py
 >>> import torch
 
->>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
+>>> dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
+>>> dataset = dataset.with_format(type="torch")
 >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
 ```
 </pt>
 
@@ -241,21 +241,21 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
-'idx': Value(dtype='int32', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['not_equivalent', 'equivalent']),
+'idx': Value(dtype='int32')}
 
 >>> from datasets import ClassLabel, Value
 >>> new_features = dataset.features.copy()
 >>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
 >>> new_features["idx"] = Value('int64')
 >>> dataset = dataset.cast(new_features)
 >>> dataset.features
-{'sentence1': Value(dtype='string', id=None),
-'sentence2': Value(dtype='string', id=None),
-'label': ClassLabel(names=['negative', 'positive'], id=None),
-'idx': Value(dtype='int64', id=None)}
+{'sentence1': Value(dtype='string'),
+'sentence2': Value(dtype='string'),
+'label': ClassLabel(names=['negative', 'positive']),
+'idx': Value(dtype='int64')}
 ```
 
 <Tip>
@@ -268,11 +268,11 @@ Use [`IterableDataset.cast_column`] to change the feature type of just one colum
 
 ```py
 >>> dataset.features
-{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
+{'audio': Audio(sampling_rate=44100, mono=True)}
 
 >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
 >>> dataset.features
-{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
+{'audio': Audio(sampling_rate=16000, mono=True)}
 ```
 
 ## Map
@@ -517,6 +517,12 @@ Save your dataset by providing the name of the dataset repository on Hugging Fac
 dataset.push_to_hub("username/my_dataset")
 ```
 
+If the dataset consists of multiple shards (`dataset.num_shards > 1`), you can use multiple processes to upload it in parallel. This is especially useful if you applied `map()` or `filter()` steps since they will run faster in parallel:
+
+```python
+dataset.push_to_hub("username/my_dataset", num_proc=8)
+```
+
 Use the [`load_dataset`] function to reload the dataset:
 
 ```python
 
@@ -175,22 +175,61 @@ Most image models expect the image to be in the RGB mode. The Beans images are a
 >>> dataset = dataset.cast_column("image", Image(mode="RGB"))
 ```
 
-**3**. Now, you can apply some transforms to the image. Feel free to take a look at the [various transforms available](https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py) in torchvision and choose one you'd like to experiment with. This example applies a transform that randomly rotates the image:
+**3**. Now let's apply data augmentations to your images. 🤗 Datasets works with any augmentation library, and in this example we'll use Albumentations.
+
+[Albumentations](https://albumentations.ai) is a popular image augmentation library that provides a [rich set of transforms](https://albumentations.ai/docs/reference/supported-targets-by-transform/) including spatial-level transforms, pixel-level transforms, and mixing-level transforms.
+
+Install Albumentations:
+
+```bash
+pip install albumentations
+```
+
+**4**. Create a typical augmentation pipeline with Albumentations:
 
 ```py
->>> from torchvision.transforms import RandomRotation
+>>> import albumentations as A
+>>> import numpy as np
+>>> from PIL import Image
+
+>>> transform = A.Compose([
+...     A.RandomCrop(height=256, width=256, pad_if_needed=True, p=1),
+...     A.HorizontalFlip(p=0.5),
+...     A.ColorJitter(p=0.5)
+... ])
+```
+
+**5**. Since 🤗 Datasets uses PIL images but Albumentations expects NumPy arrays, you need to convert between formats:
 
->>> rotate = RandomRotation(degrees=(0, 90))
->>> def transforms(examples):
-...     examples["pixel_values"] = [rotate(image) for image in examples["image"]]
+```py
+>>> def albumentations_transforms(examples):
+...     # Apply Albumentations transforms
+...     transformed_images = []
+...     for image in examples["image"]:
+...         # Convert PIL to numpy array (OpenCV format)
+...         image_np = np.array(image.convert("RGB"))
+...         
+...         # Apply Albumentations transforms
+...         transformed_image = transform(image=image_np)["image"]
+...         
+...         # Convert back to PIL Image
+...         pil_image = Image.fromarray(transformed_image)
+...         transformed_images.append(pil_image)
+...     
+...     examples["pixel_values"] = transformed_images
 ...     return examples
 ```
 
-**4**. Use the [`~Dataset.set_transform`] function to apply the transform on-the-fly. When you index into the image `pixel_values`, the transform is applied, and your image gets rotated.
+**6**. Apply the transform using [`~Dataset.with_transform`]:
 
 ```py
->>> dataset.set_transform(transforms)
+>>> dataset = dataset.with_transform(albumentations_transforms)
 >>> dataset[0]["pixel_values"]
 ```
 
-**5**. The dataset is now ready for training with your machine learning framework!
+**Key points when using Albumentations with 🤗 Datasets:**
+- Convert PIL images to NumPy arrays before applying transforms
+- Albumentations returns a dictionary with the transformed image under the "image" key
+- Convert the result back to PIL format after transformation
+
+**7**. The dataset is now ready for training with your machine learning framework!