Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
d2c35fc
added train script but with prefix manually declared
May 7, 2022
f977b85
made new dataset
May 9, 2022
fcfbf17
minor adjustments
May 9, 2022
870dfd8
added capabilities for padding and prefix lm index
lintangsutawika May 9, 2022
791bbd0
added finetune script
lintangsutawika May 9, 2022
0f44b92
removed script
lintangsutawika May 9, 2022
2ff0815
added adjustments and new dataset
May 9, 2022
f0a79f6
try mlm dataset
May 9, 2022
eb416c7
minor changes
May 9, 2022
c0bc21b
minor addition of import packages
May 9, 2022
82e824c
minor error fix
May 9, 2022
7bb17ec
minor error fix
May 9, 2022
9929766
samples follow how gpt dataset is loaded
May 9, 2022
861c41f
added masked_lm_prob
May 9, 2022
fe95115
fixed tokenizer abstractions for HF tokenizer
May 9, 2022
8ea5943
added mask id
May 9, 2022
aa0d146
added mask id
May 9, 2022
215e8cc
added mask id
May 9, 2022
b6eef43
added mask id
May 9, 2022
bfc73a5
added fix
May 9, 2022
1890f87
added bos and eos token id
May 9, 2022
01392a9
no need for sentinal token
May 9, 2022
923decb
add aux functions
May 9, 2022
4611d67
add aux functions
May 9, 2022
4356de3
add aux functions
May 9, 2022
f31c686
add pad_id
May 9, 2022
a3951e8
changed lm predictions to t5
May 18, 2022
97b9a92
changed lm predictions to t5
May 18, 2022
fe73a73
changed lm predictions to t5
May 18, 2022
6a9cb75
changed lm predictions to t5
May 18, 2022
469848f
changed lm predictions to t5
May 18, 2022
e68283f
tokenizer add mask, cls, sep tokens
May 18, 2022
476ae94
commit latest changes
May 21, 2022
72ff575
commit latest changes
May 21, 2022
3647291
added sentinal tokens
May 21, 2022
fcdc987
added sentinal tokens
May 21, 2022
d6fbe78
added sentinal tokens
May 21, 2022
c44daba
added additional_special_tokens
May 21, 2022
a2725d8
added additional_special_tokens
May 21, 2022
0e94245
check t5_input and output
May 21, 2022
b599ab6
check decoder in and decoder out
May 21, 2022
626b0ae
made into input and output tokens
May 22, 2022
6008937
made into input and output tokens
May 22, 2022
c1524db
made into input and output tokens
May 22, 2022
c59c061
made into input and output tokens
May 22, 2022
e677e16
made into input and output tokens
May 22, 2022
9ffaeb9
made into input and output tokens
May 22, 2022
d0a6a2f
made into input and output tokens
May 22, 2022
47fd987
made into input and output tokens
May 23, 2022
4f377e8
made into input and output tokens
May 23, 2022
5c0bf76
added eos
May 23, 2022
7c63e4b
added eos
May 23, 2022
871124c
test text_token
May 24, 2022
55a593d
test text_token
May 24, 2022
adb59ca
test text_token
May 24, 2022
d71afb4
test text_token
May 24, 2022
7b99bb7
test text_token
May 24, 2022
922b09d
assigned array
May 24, 2022
469a02d
assigned array
May 24, 2022
15cb6a0
assigned array
May 24, 2022
5b0bc17
hardcoded sequence length
May 24, 2022
0671c79
check again
May 28, 2022
6db5c9b
show sentinal tokens
lintangsutawika May 28, 2022
8a58007
show sentinal tokens
lintangsutawika May 28, 2022
8b0bbc2
show sentinal tokens
lintangsutawika May 28, 2022
3d1b256
show sentinal tokens
lintangsutawika May 28, 2022
ce00fd9
add more special tokens
lintangsutawika May 28, 2022
3bcc50c
changed how mlm data is loaded
lintangsutawika May 28, 2022
76960f7
changed how mlm data is loaded
lintangsutawika May 28, 2022
229d661
changed how mlm data is loaded
lintangsutawika May 28, 2022
55e3df7
changed how mlm data is loaded
lintangsutawika May 28, 2022
05dea6d
changed how mlm data is loaded
lintangsutawika May 28, 2022
661c8bb
added new script
lintangsutawika May 28, 2022
97d3810
added new script
lintangsutawika May 28, 2022
71388ee
added new script
lintangsutawika May 28, 2022
b0f04d5
try t5 dataset
lintangsutawika May 28, 2022
cd43a54
try t5 dataset
lintangsutawika May 28, 2022
e0dc666
try t5 dataset
lintangsutawika May 28, 2022
866cee1
try t5 dataset
lintangsutawika May 28, 2022
0b56a7d
try t5 dataset
lintangsutawika May 28, 2022
5bb512b
try t5 dataset
lintangsutawika May 28, 2022
31d844f
try t5 dataset
lintangsutawika May 28, 2022
1d21963
try t5 dataset
lintangsutawika May 28, 2022
1429645
try t5 dataset
lintangsutawika May 28, 2022
f5341f8
try t5 dataset
lintangsutawika May 28, 2022
b05b175
try t5 dataset
lintangsutawika May 28, 2022
59a6e32
try t5 dataset
lintangsutawika May 28, 2022
ab76d49
developing
lintangsutawika May 28, 2022
0d8dfac
developing
lintangsutawika May 28, 2022
e629224
developing
lintangsutawika May 28, 2022
efcf50f
developing
lintangsutawika May 28, 2022
e5eb615
developing
lintangsutawika May 28, 2022
2eee807
developing
lintangsutawika May 28, 2022
5840a11
developing
lintangsutawika May 28, 2022
6d38f73
test to see output of get_ltor_masks_and_position_ids
lintangsutawika May 29, 2022
430fa6f
test to see output of get_ltor_masks_and_position_ids
lintangsutawika May 29, 2022
444314f
add new script
May 29, 2022
26c837d
add new script
May 29, 2022
feb023c
add new script
May 29, 2022
f30b9b1
changed settings
May 30, 2022
0a9203a
changed settings
May 30, 2022
672a866
tidy up
May 31, 2022
3780e61
changed tokenizer and position embedding
May 31, 2022
2130c31
modifying mlm to reflect original implementation
Jun 2, 2022
26afe43
minor fix
Jun 2, 2022
c1b9816
minor fix
Jun 2, 2022
453822f
minor fix
Jun 2, 2022
a62266a
minor fix
Jun 2, 2022
02dda79
minor fix
Jun 2, 2022
80331cb
minor fix
Jun 2, 2022
350227d
minor fix
Jun 2, 2022
d0eecd4
minor fix
Jun 2, 2022
243cebe
minor fix
Jun 2, 2022
da22e0b
minor fix
Jun 2, 2022
083dce7
minor fix
Jun 2, 2022
541e9d6
minor fix
Jun 2, 2022
86bfc8a
minor fix
Jun 2, 2022
e21a448
minor fix
Jun 2, 2022
f47d678
minor fix
Jun 2, 2022
415b8bc
minor fix
Jun 2, 2022
79bd6f8
minor fix
Jun 2, 2022
ba19fdf
minor fix
Jun 2, 2022
d200f4d
minor fix
Jun 2, 2022
102a461
minor fix
Jun 2, 2022
e530440
minor fix
Jun 2, 2022
2568039
minor fix
Jun 2, 2022
e6b4120
minor fix
Jun 2, 2022
fd7fe97
minor fix
Jun 2, 2022
861fc7b
minor fix
Jun 2, 2022
21c1984
minor fix
Jun 2, 2022
14e8d0f
minor fix
Jun 2, 2022
920343f
minor fix
Jun 2, 2022
a68873d
minor fix
Jun 2, 2022
5d43986
minor fix
Jun 2, 2022
79e8c1a
set correct seq len
Jun 2, 2022
786d252
refined sampling method
Jun 8, 2022
9110520
refined sampling method
Jun 8, 2022
7db34b9
refined sampling method
Jun 8, 2022
d946515
refined sampling method
Jun 8, 2022
bb4e656
refined sampling method
Jun 8, 2022
2e7161d
refined sampling method
Jun 8, 2022
00473e4
first commit, adding non causal mlm dataset
Jun 8, 2022
5992776
fixed mlm dataset
Jun 8, 2022
83f5dee
fixed mlm dataset
Jun 8, 2022
3235c2d
fixed mlm dataset
Jun 8, 2022
5449978
fixed mlm dataset
Jun 8, 2022
95c9851
fixed mlm dataset
Jun 8, 2022
9ff6172
Merge branch 'bigscience-workshop:main' into mt0
Jun 12, 2022
451318f
minor changes
Jun 14, 2022
edfaa19
Merge branch 'mt0' of https://github.com/lintangsutawika/Megatron-Dee…
Jun 14, 2022
5657083
removed multitask finetuning related scripts
Jun 22, 2022
1cee345
Merge branch 'bigscience-workshop:main' into mlm-adaptation
Jun 22, 2022
b4b87fc
remove any unrelated to dataset, revert arguments.py
Jun 22, 2022
5e80cc1
revert tokenizer
Jun 22, 2022
253e81f
Improve MLM
thomasw21 Jun 23, 2022
1d8a5c0
Woops
thomasw21 Jun 23, 2022
e6036a0
Remove a bunch of attributes
thomasw21 Jun 23, 2022
408f16a
Fix naming
thomasw21 Jun 23, 2022
ae87552
Woops
thomasw21 Jun 23, 2022
e79c9a2
Use GPTDataset as underlying implementation
thomasw21 Jun 23, 2022
62ee550
Fix sep tokens
thomasw21 Jun 23, 2022
a2e9ba8
Change attribute naming
thomasw21 Jun 23, 2022
64334a4
GPT Dataset doesn't handle slicing
thomasw21 Jun 23, 2022
7a872c2
Remove tokenizer
thomasw21 Jun 23, 2022
b6f02c5
WIP
thomasw21 Jun 23, 2022
86680bc
WIP
thomasw21 Jun 23, 2022
4b2d840
WIP
thomasw21 Jun 23, 2022
b935b85
WIP
thomasw21 Jun 23, 2022
9a74d69
WIP
thomasw21 Jun 23, 2022
64b1515
WIP
thomasw21 Jun 23, 2022
b210364
WIP
thomasw21 Jun 23, 2022
6398d1d
MLM
thomasw21 Jun 23, 2022
e0f7c92
Cleanup
thomasw21 Jun 23, 2022
6b92958
Update megatron/data/mlm_dataset.py
Jun 26, 2022
faf0b9e
Cleanup + fix off by one issue
thomasw21 Jun 27, 2022
0e3ee15
Missing vocab extra ids
thomasw21 Jun 27, 2022
92070ce
Woops
thomasw21 Jun 27, 2022
ea69602
Understanding off by one isse
thomasw21 Jun 27, 2022
4dbe448
Woops
thomasw21 Jun 27, 2022
8f42790
Add
thomasw21 Jun 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions megatron/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,9 @@ def __call__(self, parser, args, values, option_string=None):
'specific positions. This option tries to un-bias the loss by reweighting loss on specific '
'positions based on how frequently we train on that position.'
'This is mostly used for prefix_lm training')
group.add_argument("--noise_density", type=float, default=None, help="Span corruption noise density")
group.add_argument("--mean_noise_span_length", type=int, default=None, help="Span corruption mean noise span length")


return parser

Expand Down
2 changes: 1 addition & 1 deletion megatron/data/gpt_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,

# Single dataset.
if len(data_prefix) == 1:
all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
data_impl, splits_string,
train_valid_test_num_samples,
seq_length, seed, skip_warmup)
Expand Down
Loading