-
Notifications
You must be signed in to change notification settings - Fork 867
Param dlrm #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shz0116
wants to merge
58
commits into
facebookresearch:dist_exp
Choose a base branch
from
shz0116:param_dlrm
base: dist_exp
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Param dlrm #127
Changes from all commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
bdba86d
Adding support for binary loader proposed in pull request https://git…
mnaumovfb 58c2806
Adding support for testing and mlperf flags to caffe2 version.
mnaumovfb 1768658
Fix latent bug in caffe2 version when --max-ind-range is used
mnaumovfb 11fcf01
Tgrel/minor mlperf fixes (#54)
tgrel eb3094c
Update README.md
mnaumovfb bda0921
Update README.md
mnaumovfb 73ac38a
adding back end of epoch check for now.
mnaumovfb 0e8818e
Adding flexibility in saving/loading model to/from different devices.…
mnaumovfb 916c0d6
Adjusting the use of --max-ind-range post processing as discussed in …
mnaumovfb 819ef5f
Switch the binary dataloader to int32 datatype (#60)
tgrel fde9723
Adjusting restart from saved model during training. Need to skip earl…
mnaumovfb 9074b5e
Update README.md
mnaumovfb 6f7711d
Update README.md
mnaumovfb 66fe6d8
Update README.md
mnaumovfb 7f2129e
added visualization of DLRM embeddings (#72)
dkorchevgithub fbabe61
Enable LR warmup and decay policy (#73)
rachithayp cef3b73
added more visualization options (#76)
dkorchevgithub 75b02cf
Adjusting ONNX calls to work with large models (more than 2GB in size).
mnaumovfb 09017b8
Adjusting the learning rate to freeze at last, when passed the decay …
mnaumovfb 1f25892
Mixd Bugfixes (#87)
tginart 236e331
Added gitignore from https://github.com/github/gitignore/blob/master/…
taylanbil 32181f7
Adjusting parameters for onnx.export to work with any data loader.
mnaumovfb 6bd3adb
Fixing a typo.
mnaumovfb ae23fca
Tgrel/tgrel mlperf fixes (#93)
tgrel 3ecf641
Trimming trailing whitespaces (#100)
huwan d54c813
Adding tqdm package in requirements (#99)
huwan ce31eda
latest updates, 2020-06-27 (#103)
dkorchevgithub f8bf6ab
Fixing saving of model protobuf with types and shapes in caffe2 version.
mnaumovfb 2944521
modifications for FAIR cluster
e6009d4
add projection
3170d35
change output file size
18996f6
test
9951699
bug fix in projection
cb44674
Add validation checks to arguments using dash separated lists and che…
mneilly-et 3a9b5cf
add gaussian distribution
53bf84b
add synthetic data
2da35a8
Merge branch 'master' into dist_port
eaee70c
small fix
d32dfd7
clean files
f1d301c
add readme for param branch
2fe7f81
put project into separate file
788dc43
add fb_synthetic data
ebef575
Fix hang problem and add README
b0420b0
remove README.param
38ecd56
data module cean-up
amirstar 323a593
data module cean-up
amirstar b016326
copy dlrm_data.py from PARAM-Bench
amirstar 601ad2e
update project file
e1e2ca5
add boundary check for dlrm_data
0eb05fd
usse synthetic data
a755b01
modify time computation method
40dddb7
change output + turn off nvidia-smi + reuse syn data
64b7355
start to change input
c0fba86
add tt.py
77541e5
tested version on FAIR
61875e4
hack data access
4692d0e
modify tt to reuse input
751a2ca
fix corner case in tt.py
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
|
||
# DLRM Distributed Branch | ||
|
||
Extend the PyTorch implementation to run DLRM on multi nodes on distributed platforms. | ||
The distributed version will be needed when data model becomes large. | ||
|
||
It inherents all the parameters from master DLRM implementation. | ||
The distributed version add one more parameter: | ||
|
||
**--dist-backend**: | ||
The backend support for the distributed version. As in torch.distributed package, | ||
it can be "nccl", "mpi", and "gloo". | ||
|
||
In addition, it introduces the following new parameter:: | ||
**--arch-project-size** : | ||
Reducing the number of interaction features for the dot operation. | ||
A project operation is applied to the dotted features to reduce its dimension size. | ||
This is mainly due to the memory concern. It reduces the memory size needed for top MLP. | ||
A side effect is that it may also imrpove the model accuracy. | ||
|
||
## Usage | ||
|
||
Currently, it is launched with mpirun on multi-nodes. The hostfile need to be created or | ||
a host list should be given. The DLRM parameters should be given in the same way as single | ||
node master branch. | ||
```bash | ||
mpirun -np 128 -hostfile hostfile python dlrm_s_pytorch.py ... | ||
``` | ||
|
||
## Example | ||
```bash | ||
python dlrm_s_pytorch.py | ||
--arch-sparse-feature-size=128 | ||
--arch-mlp-bot="2000-1024-1024-128" | ||
--arch-mlp-top="4096-4096-4096-1" | ||
--arch-embedding-size=$large_arch_emb | ||
--data-generation=random | ||
--loss-function=bce | ||
--round-targets=True | ||
--learning-rate=0.1 | ||
--mini-batch-size=2048 | ||
--print-freq=10240 | ||
--print-time | ||
--test-mini-batch-size=16384 | ||
--test-num-workers=16 | ||
--num-indices-per-lookup-fixed=1 | ||
--num-indices-per-lookup=100 | ||
--arch-projection-size 30 | ||
--use-gpu | ||
``` | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: imrpove->improve.