You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-3Lines changed: 1 addition & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,9 +130,7 @@ You can find:
130
130
131
131
# Disclaimers
132
132
133
-
You can use 🤗 Datasets to load datasets based on Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, this feature is disabled by default and requires passing `trust_remote_code=True`. In this case we also ask users that want to load such datasets to:
134
-
- check the dataset scripts they're going to run beforehand and
135
-
- pin the `revision` of the repositories they use.
133
+
You can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the `revision` of the repositories they use.
136
134
137
135
If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!
Copy file name to clipboardExpand all lines: docs/source/about_dataset_load.mdx
+2-16Lines changed: 2 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,6 @@ A dataset is a directory that contains:
10
10
11
11
- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
12
12
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
13
-
- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.
14
13
15
14
The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
16
15
The Hub is a central repository where all the Hugging Face datasets and models are stored.
@@ -27,12 +26,9 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
27
26
*[`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
28
27
*[`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
29
28
30
-
If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub.
31
-
Code in the dataset script defines a custom [`DatasetBuilder`] the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
32
-
33
29
<Tip>
34
30
35
-
Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
31
+
Read the [Share](./upload_dataset) section to learn more about how to share a dataset.
36
32
37
33
</Tip>
38
34
@@ -99,13 +95,7 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests
99
95
- The list of downloaded files.
100
96
- The SHA256 checksums of the downloaded files (disabled by default).
101
97
102
-
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
103
-
104
-
<Tip>
105
-
106
-
If it is your own dataset, you'll need to recompute the information above and update the `README.md` file in your dataset repository. Take a look at this [section](dataset_script#optional-generate-dataset-metadata) to learn how to generate and update this metadata.
107
-
108
-
</Tip>
98
+
If the dataset doesn't pass the verifications, it is likely that the dataset author made some changes in the data files.
109
99
110
100
In this case, an error is raised to alert that the dataset has changed.
111
101
To ignore the error, one needs to specify `verification_mode="no_checks"` in [`load_dataset`].
@@ -114,7 +104,3 @@ Anytime you see a verification error, feel free to open a discussion or pull req
114
104
## Security
115
105
116
106
The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).
117
-
118
-
Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers.
119
-
The code of these datasets is considered **safe**.
120
-
It concerns datasets that are not under a namespace, e.g. "rajpurkar/squad" or "nyu-mll/glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
0 commit comments