You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 15, 2024. It is now read-only.
Copy file name to clipboardExpand all lines: scripts/datasets/pretrain_corpus/README.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,36 +2,41 @@
2
2
3
3
We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models.
4
4
This helps create a unified text corpus for studying the performance of different pretraining algorithms.
5
-
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6
-
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
5
+
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6
+
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
7
7
8
8
## BookCorpus
9
9
Unfortunately, we are unable to provide the original [Toronto BookCorpus dataset](https://yknzhu.wixsite.com/mbweb) due to licensing issues.
10
10
11
11
There are some open source efforts for reproducing the dataset, e.g.,
12
-
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13
-
12
+
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13
+
14
14
Nevertheless, we utilize the [Project Gutenberg](https://www.gutenberg.org/) as an alternative to Toronto BookCorpus.
15
15
16
-
You can use the following command to download and prepare the Gutenberg dataset.
16
+
You can use the following command to download and prepare the Gutenberg dataset.
17
17
18
18
```bash
19
19
python prepare_bookcorpus.py --dataset gutenberg
20
20
```
21
21
22
-
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
22
+
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
23
23
24
24
## Wikipedia
25
25
26
26
Please install [attardi/wikiextractor](https://github.com/attardi/wikiextractor) for preparing the data.
27
27
28
-
```
28
+
```bash
29
29
# Download
30
30
python prepare_wikipedia.py --mode download --lang en --date latest -o ./
31
31
32
32
# Properly format the text files
33
33
python prepare_wikipedia.py --mode format -i [path-to-wiki.xml.bz2] -o ./
34
34
35
+
```
36
+
The process of downloading and formatting is time consuming, and we offer an alternative solution to download the prepared raw text file from S3 bucket. This raw text file is in English and was dumped at 2020-06-20 being formated by the above very process (` --lang en --date 20200620`).
0 commit comments