Skip to content

Add tensorboard to display training and evaluation metrics #3163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions torchrec/distributed/benchmark/benchmark_zch/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Zero Collision Hashing (ZCH) Benchmarking Testbed

This testbed is used to benchmark the performance of ZCH algorithms with respect to the efficiency, accuracy, and collision management performances. Specifically, the testbed collects the following metrics:
- QPS: query per second, the number of input faeture values the model can process in a second.
- Collision rate: the percentage of collisions in the hash table. High collision rate means that lots of potentially irrelevant features are mapped to the same hash value, which can lead to information loss and decreased accuracy.
- NE: normalized entropy, a measure of the confidence of models on the prediction results of classification tasks.
- AUC: area under the curve, a metric used to evaluate the performance of classification models.
- MAE: mean absolute error, a measure of the average magnitude of errors in regression tasks.
- MSE: mean squared error, a measure of the average squared error in regression tasks.

## Pre-regression
Before running the benchmark, it is important to ensure that the environment is properly set up. The following steps should be taken
1. Prepare Python environment (Python 3.9+)
2. Install the necessary dependencies
```bash
# Install torch and fbgemm_gpu following instructions in https://docs.pytorch.org/FBGEMM/fbgemm_gpu/development/InstallationInstructions.html
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
pip install --pre fbgemm-gpu --index-url https://download.pytorch.org/whl/nightly/cu126/
# Install torchrec
pip install torchrec --index-url https://download.pytorch.org/whl/nightly/cu126
# Install generative recommenders
git clone https://github.com/meta-recsys/generative-recommenders.git
cd generative-recommenders
pip install -e .
```

## Running the benchmark
To run the benchmark, use the following command:
```bash
WORLD_SIZE=1 python benchmark_zch.py -- --profiling_result_folder result_tbsize_10000_nonzch_dlrmv3_kuairand1k --dataset_name kuairand_1k --batch_size 16 --learning_rate 0.001 --dense_optim adam --sparse_optim adagrad --epochs 5 --num_embeddings 10000
```
More options can be found in the [arguments.py](arguments.py) file.

## Repository Structure
- [benchmark_zch.py](benchmark_zch.py): the main script for running the benchmark.
- [arguments.py](arguments.py): contains the arguments for the benchmark.
- [benchmark_zch_utils.py](benchmark_zch_utils.py): utility functions for the benchmark.
- [count_dataset_distributions.py](count_dataset_distributions.py): script for counting the distribution of features in the dataset.
- [data](data): directory containing the dataset used in the benchmark.
- [models](models): directory containing the models used in the benchmark.
- [plots](plots): directory containing the plotting notebooks for the benchmark.
- [figures](figures): directory containing the figures generated by the plotting notebooks.

## To add a new model
To add a new model to the benchmark, follow these steps:
1. Create a new configuration yaml file named as <new_model_name>.yaml in the [models/configs](models/configs) directory.
- Besides the basic configurations like embedding dimensions, number of embeddings, etc. the yaml file must also contain the following two fields:
- embedding_module_attribute_path: the path to the embedding module in the model, either the EmbeddingCollection or the EmbeddingBagCollection.
- managed_collision_module_attribute_path: the path to the managed collision module in the model, if once appilied. It should in the following format: "module.<embedding_module_attribute_path>.mc_embedding_collection._managed_collision_collection._managed_collision_modules".
2. Create a new model class in the [models/models](models/models) directory, named as <new_model_name>.py.
- The model class should act as a wrapper for the new model, and it should
- contain the following attributes
- eval_flag (bool): whether the model is in the evaluation or training mode.
- table_configs (List[Dict[str, EmbeddingConfig]]): a list of dictionaries containing the configuration of each embedding table.
- override the following methods
- forward(self, batch: Dict[str, Any]) -> torch.Tensor: the forward method of the model. The forward method should make the model compatible with the ipnut from the Batch dataclass, and output in the format of `summed_loss, (prediction_logits, prediction_labels, prediction_weights)`.
- eval(self) -> None: set the model to the evaluation mode.
- Implement the `make_model_<new_model_name>` function in the [models/make_model.py](models/make_model.py) file. The function should takes three parameters:
- args: the arguments passed to the benchmark.
- configs: the configuration of the model and dataset.
- device: the device to run the model on.
The function should return an instance of the new model class. It also contains the code to replace its embedding module with the ZCH embedding module using a `mc_adapter` object.

3. Add the new model to the [models/__init__.py](models/__init__.py) file with `from .<new_model_name>.py import make_model_<new_model_name>`.
4. Add the new model to the [models/make_model.py](models/make_model.py) file with
- Add `make_model_<new_model_name>` to the `from .models import` line.
- ADD a condition branch `elif model_name == "<new_model_name>"` to the `make_model` function, in which
- read the configuration file from `os.path.join(os.path.dirname(__file__), "configs", "<new_model_name>.yaml")`.
- read the dataset configuration from `os.path.join(os.path.dirname(__file__), "..", "data", "configs", f"{args.dataset_name}.yaml")`.
- call the make_model_<new_model_name> function with the configuration and dataset configuration.

## To add a new dataset
To add a new dataset to the benchmark, follow these steps:
1. Create a new configuration yaml file named as <new_dataset_name>.yaml in the [data/configs](data/configs) directory.
- The yaml file must contain the following fields:
- dataset_path: the path to the dataset.
- batch_size: the batch size of the dataset.
- num_workers: the number of workers to load the dataset.
- Besides the three required fields, the yaml file should also contain nenecessary fields for loading and ingesting the dataset.
2. Create a new dataset preprocess script in the [data/preprocess](data/preprocess) directory, named as <new_dataset_name>.py.
- The script should contain a definition to the corresponding Batch dataclass, which should be a dataclass that contains necessary attributes, and override the following methods:
- to(self, device: torch.device, non_blocking: bool = False) -> Batch: the method to move the data to the specified device.
- pin_memory(self) -> Batch: the method to pin the data in memory.
- record_stream(self, stream: torch.cuda.streams.Stream) -> None: the method to record the data stream.
- get_dict(self) -> Dict[str, Any]: the method to get the data as a dictionary of `{<attribute_name>: <attribute_value>}`.
- The script should also include a dataset class. The dataset class should act as a wrapper for the new dataset, and it should at least override the following methods:
- __init__(self, config: Dict[str, Any], device: torch.device) -> None: the constructor of the dataset class. It should take a dictionary of configuration and a device as input, and initialize the dataset. When initializing the dataset, it must include a `items_in_memory` attribute as a list of Batch dataclass.
- __len__(self) -> int: the length of the dataset.
- __getitem__(self, idx: int) -> Dict[str, Any]: the method to get an item from the dataset. It should take an index as input, and return the data in the format of Batch dataclass.
- load_item(self, idx: int) -> Dict[str, Any]: the method to load an item from the dataset. It should take an index as input, and return the data in the format of Batch dataclass.
- get_sample(self, idx: int) -> Dict[str, Any]: the method to get a sample from the dataset. It should take an index as input, and return the data from the items_in_memory list.
- __getitems__(self, idxs: List[int]) -> List[Dict[str, Any]]: the method to get a list of items from the dataset. It should take a list of indices as input, and return the data in the format of a list of Batch dataclass.
- The script should include a `collate_fn` that takes a list of Batch dataclass and returns a Batch dataclass.
- The script should finally include a `get_<new_dataset_name>_dataloader` function that takes three parameters:
- args: the arguments passed to the benchmark.
- configs: the configuration of the model and dataset.
- stage: the stage of the benchmark, either "train" or "val".
The function should return a dataloader for the new dataset.
165 changes: 165 additions & 0 deletions torchrec/distributed/benchmark/benchmark_zch/arguments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#!/usr/bin/env python3
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# pyre-strict
import argparse
from typing import List


def parse_args(argv: List[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(description="torchrec dlrm example trainer")

# Dataset related arguments
parser.add_argument(
"--dataset_name",
type=str,
choices=["movielens_1m", "criteo_kaggle", "kuairand_1k"],
default="movielens_1m",
help="dataset for experiment, current support criteo_1tb, criteo_kaggle, kuairand_1k",
)

# Model related arguments
parser.add_argument(
"--model_name",
type=str,
choices=["dlrmv2", "dlrmv3"],
default="dlrmv3",
help="model for experiment, current support dlrmv2, dlrmv3. Dlrmv3 is the default",
)
parser.add_argument(
"--num_embeddings", # ratio of feature ids to embedding table size # 3 axis: x-bath_idx; y-collisions; zembedding table sizes
type=int,
default=None,
help="max_ind_size. The number of embeddings in each embedding table. Defaults"
" to 100_000 if num_embeddings_per_feature is not supplied.",
)
parser.add_argument(
"--embedding_dim",
type=int,
default=64,
help="Size of each embedding.",
)
parser.add_argument(
"--seed",
type=int,
help="Random seed for reproducibility.",
default=0,
)

# Training related arguments
parser.add_argument(
"--epochs",
type=int,
default=1,
help="number of epochs to train",
)
parser.add_argument(
"--batch_size",
type=int,
default=4096,
help="batch size to use for training",
)
parser.add_argument(
"--sparse_optim",
type=str,
default="adagrad",
help="The optimizer to use for sparse parameters.",
)
parser.add_argument(
"--dense_optim",
type=str,
default="adagrad",
help="The optimizer to use for sparse parameters.",
)
parser.add_argument(
"--learning_rate",
type=float,
default=1.0,
help="Learning rate.",
)
parser.add_argument(
"--eps",
type=float,
default=1e-8,
help="Epsilon for Adagrad optimizer.",
)
parser.add_argument(
"--weight_decay",
type=float,
default=0,
help="Weight decay for Adagrad optimizer.",
)
parser.add_argument(
"--beta1",
type=float,
default=0.95,
help="Beta1 for Adagrad optimizer.",
)
parser.add_argument(
"--beta2",
type=float,
default=0.999,
help="Beta2 for Adagrad optimizer.",
)
parser.add_argument(
"--shuffle_batches",
dest="shuffle_batches",
action="store_true",
help="Shuffle each batch during training.",
)
parser.add_argument(
"--validation_freq_within_epoch",
type=int,
default=None,
help="Frequency at which validation will be run within an epoch.",
)
parser.set_defaults(
pin_memory=None,
mmap_mode=None,
drop_last=None,
shuffle_batches=None,
shuffle_training_set=None,
)
parser.add_argument(
"--input_hash_size",
type=int,
default=100_000,
help="Input feature value range",
)
parser.add_argument(
"--profiling_result_folder",
type=str,
default="profiling_result",
help="Folder to save profiling results",
)
parser.add_argument(
"--zch_method",
type=str,
help="The method to use for zero collision hashing, blank for no zch",
default="",
)
parser.add_argument(
"--num_buckets",
type=int,
default=4,
help="Number of buckets for identity table. Only used for MPZCH. The number of ranks WORLD_SIZE must be a factor of num_buckets, and the number of buckets must be a factor of input_hash_size",
)
parser.add_argument(
"--max_probe",
type=int,
default=None,
help="Number of probes for identity table. Only used for MPZCH",
)

# testbed related arguments
parser.add_argument(
"--log_path",
type=str,
default="log",
help="Path to save log file without the suffix",
)
return parser.parse_args(argv)
Loading
Loading