Skip to content

Commit 735fbca

Browse files
authored
Merge branch 'main' into cdc
2 parents ef901ea + 34e850b commit 735fbca

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1021
-3582
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ on:
1010
# - ci-*
1111

1212
env:
13-
HF_ALLOW_CODE_EVAL: 1
13+
CI_HEADERS: ${{ secrets.CI_HEADERS }}
1414

1515
jobs:
1616

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -130,9 +130,7 @@ You can find:
130130

131131
# Disclaimers
132132

133-
You can use 🤗 Datasets to load datasets based on Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, this feature is disabled by default and requires passing `trust_remote_code=True`. In this case we also ask users that want to load such datasets to:
134-
- check the dataset scripts they're going to run beforehand and
135-
- pin the `revision` of the repositories they use.
133+
You can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the `revision` of the repositories they use.
136134

137135
If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!
138136

docs/source/_toctree.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,8 +106,6 @@
106106
title: Create a dataset card
107107
- local: repository_structure
108108
title: Structure your repository
109-
- local: dataset_script
110-
title: Create a dataset loading script
111109
title: "Dataset repository"
112110
title: "How-to guides"
113111
- sections:

docs/source/about_dataset_load.mdx

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ A dataset is a directory that contains:
1010

1111
- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
1212
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
13-
- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.
1413

1514
The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
1615
The Hub is a central repository where all the Hugging Face datasets and models are stored.
@@ -27,12 +26,9 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
2726
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
2827
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
2928

30-
If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub.
31-
Code in the dataset script defines a custom [`DatasetBuilder`] the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
32-
3329
<Tip>
3430

35-
Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
31+
Read the [Share](./upload_dataset) section to learn more about how to share a dataset.
3632

3733
</Tip>
3834

@@ -99,13 +95,7 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
9995
- The list of downloaded files.
10096
- The SHA256 checksums of the downloaded files (disabled by default).
10197

102-
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
103-
104-
<Tip>
105-
106-
If it is your own dataset, you'll need to recompute the information above and update the `README.md` file in your dataset repository. Take a look at this [section](dataset_script#optional-generate-dataset-metadata) to learn how to generate and update this metadata.
107-
108-
</Tip>
98+
If the dataset doesn't pass the verifications, it is likely that the dataset author made some changes in the data files.
10999

110100
In this case, an error is raised to alert that the dataset has changed.
111101
To ignore the error, one needs to specify `verification_mode="no_checks"` in [`load_dataset`].
@@ -114,7 +104,3 @@ Anytime you see a verification error, feel free to open a discussion or pull req
114104
## Security
115105

116106
The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).
117-
118-
Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
119-
The code of these datasets is considered **safe**.
120-
It concerns datasets that are not under a namespace, e.g. "rajpurkar/squad" or "nyu-mll/glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".

0 commit comments

Comments
 (0)