Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
bdba86d
Adding support for binary loader proposed in pull request https://git…
mnaumovfb Jan 26, 2020
58c2806
Adding support for testing and mlperf flags to caffe2 version.
mnaumovfb Feb 5, 2020
1768658
Fix latent bug in caffe2 version when --max-ind-range is used
mnaumovfb Feb 9, 2020
11fcf01
Tgrel/minor mlperf fixes (#54)
tgrel Feb 15, 2020
eb3094c
Update README.md
mnaumovfb Feb 15, 2020
bda0921
Update README.md
mnaumovfb Feb 15, 2020
73ac38a
adding back end of epoch check for now.
mnaumovfb Feb 15, 2020
0e8818e
Adding flexibility in saving/loading model to/from different devices.…
mnaumovfb Feb 20, 2020
916c0d6
Adjusting the use of --max-ind-range post processing as discussed in …
mnaumovfb Feb 20, 2020
819ef5f
Switch the binary dataloader to int32 datatype (#60)
tgrel Feb 27, 2020
fde9723
Adjusting restart from saved model during training. Need to skip earl…
mnaumovfb Mar 4, 2020
9074b5e
Update README.md
mnaumovfb Mar 25, 2020
6f7711d
Update README.md
mnaumovfb Mar 25, 2020
66fe6d8
Update README.md
mnaumovfb Mar 27, 2020
7f2129e
added visualization of DLRM embeddings (#72)
dkorchevgithub May 11, 2020
fbabe61
Enable LR warmup and decay policy (#73)
rachithayp May 23, 2020
cef3b73
added more visualization options (#76)
dkorchevgithub May 28, 2020
75b02cf
Adjusting ONNX calls to work with large models (more than 2GB in size).
mnaumovfb May 31, 2020
09017b8
Adjusting the learning rate to freeze at last, when passed the decay …
mnaumovfb Jun 5, 2020
1f25892
Mixd Bugfixes (#87)
tginart Jun 8, 2020
236e331
Added gitignore from https://github.com/github/gitignore/blob/master/…
taylanbil Jun 12, 2020
32181f7
Adjusting parameters for onnx.export to work with any data loader.
mnaumovfb Jun 12, 2020
6bd3adb
Fixing a typo.
mnaumovfb Jun 12, 2020
ae23fca
Tgrel/tgrel mlperf fixes (#93)
tgrel Jun 12, 2020
3ecf641
Trimming trailing whitespaces (#100)
huwan Jun 25, 2020
d54c813
Adding tqdm package in requirements (#99)
huwan Jun 25, 2020
ce31eda
latest updates, 2020-06-27 (#103)
dkorchevgithub Jun 29, 2020
f8bf6ab
Fixing saving of model protobuf with types and shapes in caffe2 version.
mnaumovfb Jul 7, 2020
2944521
modifications for FAIR cluster
Jul 12, 2020
e6009d4
add projection
Jul 20, 2020
3170d35
change output file size
Jul 28, 2020
18996f6
test
Jul 28, 2020
9951699
bug fix in projection
Jul 29, 2020
cb44674
Add validation checks to arguments using dash separated lists and che…
mneilly-et Aug 1, 2020
3a9b5cf
add gaussian distribution
Aug 6, 2020
53bf84b
add synthetic data
Aug 6, 2020
2da35a8
Merge branch 'master' into dist_port
Aug 6, 2020
eaee70c
small fix
Aug 11, 2020
d32dfd7
clean files
Aug 11, 2020
f1d301c
add readme for param branch
Aug 17, 2020
2fe7f81
put project into separate file
Aug 18, 2020
788dc43
add fb_synthetic data
Sep 12, 2020
ebef575
Fix hang problem and add README
Sep 25, 2020
b0420b0
remove README.param
Sep 25, 2020
38ecd56
data module cean-up
amirstar Sep 28, 2020
323a593
data module cean-up
amirstar Sep 28, 2020
b016326
copy dlrm_data.py from PARAM-Bench
amirstar Sep 29, 2020
601ad2e
update project file
Oct 27, 2020
e1e2ca5
add boundary check for dlrm_data
Nov 11, 2020
0eb05fd
usse synthetic data
Nov 19, 2020
a755b01
modify time computation method
Nov 30, 2020
40dddb7
change output + turn off nvidia-smi + reuse syn data
Dec 2, 2020
64b7355
start to change input
Dec 10, 2020
c0fba86
add tt.py
Dec 10, 2020
77541e5
tested version on FAIR
Dec 11, 2020
61875e4
hack data access
Dec 14, 2020
4692d0e
modify tt to reuse input
Dec 18, 2020
751a2ca
fix corner case in tt.py
Dec 21, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,7 @@ FROM ${FROM_IMAGE_NAME}
ADD requirements.txt .
RUN pip install -r requirements.txt

RUN pip install torch==1.3.1

WORKDIR /code
ADD . .
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,23 +311,24 @@ Benchmarking
./bench/dlrm_s_criteo_terabyte.sh ["--test-freq=10240 --memory-map --data-sub-sample-rate=0.875"]
```
- Corresponding pre-trained model is available under [CC-BY-NC license](https://creativecommons.org/licenses/by-nc/2.0/) and can be downloaded here
[dlrm_emb64_subsample0.875_maxindrange10M_pretrained.pt](https://dlrm.s3-us-west-1.amazonaws.com/models/tb0875_10M.pt)

[dlrm_emb64_subsample0.875_maxindrange10M_pretrained.pt](https://dlrm.s3-us-west-1.amazonaws.com/models/tb0875_10M.pt)

<img src="./terabyte_0875_loss_accuracy_plots.png" width="900" height="320">

*NOTE: Benchmarking scripts accept extra arguments which will be passed along to the model, such as --num-batches=100 to limit the number of data samples*

4) The code supports interface with [MLPerf benchmark](https://mlperf.org).
4) The code supports interface with [MLPerf benchmark](https://mlperf.org).
- Please refer to the following training parameters
```
--mlperf-logging that keeps track of multiple metrics, including area under the curve (AUC)

--mlperf-acc-threshold that allows early stopping based on accuracy metric

--mlperf-auc-threshold that allows early stopping based on AUC metric

--mlperf-bin-loader that enables preprocessing of data into a single binary file

--mlperf-bin-shuffle that controls whether a random shuffle of mini-batches is performed
```
- The MLPerf training model is completely specified and can be trained using the following script
Expand Down Expand Up @@ -367,6 +368,8 @@ pydot (*optional*)

torchviz (*optional*)

tqdm


License
-------
Expand Down
51 changes: 51 additions & 0 deletions README.params.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@

# DLRM Distributed Branch

Extend the PyTorch implementation to run DLRM on multi nodes on distributed platforms.
The distributed version will be needed when data model becomes large.

It inherents all the parameters from master DLRM implementation.
The distributed version add one more parameter:

**--dist-backend**:
The backend support for the distributed version. As in torch.distributed package,
it can be "nccl", "mpi", and "gloo".

In addition, it introduces the following new parameter::
**--arch-project-size** :
Reducing the number of interaction features for the dot operation.
A project operation is applied to the dotted features to reduce its dimension size.
This is mainly due to the memory concern. It reduces the memory size needed for top MLP.
A side effect is that it may also imrpove the model accuracy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: imrpove->improve.


## Usage

Currently, it is launched with mpirun on multi-nodes. The hostfile need to be created or
a host list should be given. The DLRM parameters should be given in the same way as single
node master branch.
```bash
mpirun -np 128 -hostfile hostfile python dlrm_s_pytorch.py ...
```

## Example
```bash
python dlrm_s_pytorch.py
--arch-sparse-feature-size=128
--arch-mlp-bot="2000-1024-1024-128"
--arch-mlp-top="4096-4096-4096-1"
--arch-embedding-size=$large_arch_emb
--data-generation=random
--loss-function=bce
--round-targets=True
--learning-rate=0.1
--mini-batch-size=2048
--print-freq=10240
--print-time
--test-mini-batch-size=16384
--test-num-workers=16
--num-indices-per-lookup-fixed=1
--num-indices-per-lookup=100
--arch-projection-size 30
--use-gpu
```

2 changes: 1 addition & 1 deletion data_loader_terabyte.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ def _test_bin():

original_dataset = CriteoDataset(
dataset='terabyte',
max_ind_range=-1,
max_ind_range=10 * 1000 * 1000,
sub_sample_rate=1,
randomize=True,
split=args.split,
Expand Down
Loading