diff --git a/PREPROCESSING-TEMPLATE.md b/PREPROCESSING-TEMPLATE.md new file mode 100644 index 0000000000..57a28bd80e --- /dev/null +++ b/PREPROCESSING-TEMPLATE.md @@ -0,0 +1,127 @@ +# Dataset Preprocessing Documentation Template + +## Purpose +This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency. + +## Template Structure + +### Model: [MODEL_NAME] +**Dataset:** [DATASET_NAME] +**Evaluation Task:** [TASK_DESCRIPTION] + +#### Data Source +- **Raw Dataset:** [SOURCE_AND_FORMAT] +- **Download Method:** [HOW_TO_OBTAIN] +- **License:** [LICENSE_INFO] + +#### Preprocessing Pipeline + +##### 1. Tokenization +```python +# Example based on llama2-70b/processorca.py pattern +from transformers import [TOKENIZER_CLASS] +tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir) +tokens = tokenizer(text)["input_ids"] +``` + +##### 2. Filtering Steps +- **Language Filter:** [DESCRIPTION] +- **Length Filter:** [SEQUENCE_LENGTH_LIMITS] +- **Quality Filter:** [QUALITY_CRITERIA] +- **Content Filter:** [CONTENT_RESTRICTIONS] + +##### 3. Formatting +- **Input Format:** [INPUT_TEMPLATE] +- **Output Format:** [OUTPUT_TEMPLATE] +- **Special Tokens:** [SPECIAL_TOKEN_HANDLING] + +##### 4. Sampling Strategy +- **Total Samples:** [NUMBER] +- **Sampling Method:** [RANDOM/STRATIFIED/OTHER] +- **Validation Split:** [IF_APPLICABLE] + +#### Adaptation Guide +**For Different Tokenizers:** +- Modify tokenizer initialization +- Adjust sequence length limits +- Update special token handling + +**For Different Models:** +- Update input/output templates +- Adjust filtering criteria +- Modify prompt formatting + +#### Files Generated +- **Main Dataset:** [FILENAME_AND_FORMAT] +- **Calibration Set:** [FILENAME_AND_FORMAT] +- **Metadata:** [FILENAME_AND_FORMAT] + +#### Verification +- **Expected Sample Count:** [NUMBER] +- **Checksum/Hash:** [IF_AVAILABLE] +- **Quality Metrics:** [ROUGE/BLEU/OTHER] + +--- + +## Example Applications + +### Llama3.1-8b (CNN/DailyMail) +**Dataset:** CNN/DailyMail 3.0.0 +**Evaluation Task:** Text Summarization + +#### Data Source +- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0 +- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")` +- **License:** Apache 2.0 + +#### Preprocessing Pipeline +##### 1. Tokenization +```python +from transformers import AutoTokenizer +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") +tokenizer.padding_side = "left" +tokenizer.pad_token = tokenizer.eos_token +tokenizer.model_max_length = 8000 +``` + +##### 2. Formatting +- **Input Template:** +``` +Summarize the following news article in 128 tokens. Please output the summary only, without any other text. + +Article: +{article} + +Summary: +``` + +##### 3. Current Gaps +- ❌ No documented filtering steps +- ❌ No sampling strategy explanation +- ❌ No quality control measures +- ❌ No reproducible preprocessing script + +### DeepSeek-r1 (Multi-domain Evaluation) +**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench +**Evaluation Task:** Multi-domain Reasoning + +#### Data Source +- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 +- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/` +- **License:** Various (CC0, MIT, CC BY 4.0) + +#### Current Gaps +- ❌ No documented preprocessing steps +- ❌ No tokenization details +- ❌ No filtering or sampling explanation +- ❌ No adaptation guide for other models +- ❌ Cannot reproduce from raw sources + +--- + +## Implementation Recommendation + +1. **For each model directory**, add `PREPROCESSING.md` following this template +2. **For models with preprocessing scripts**, document the steps in the README +3. **For models using preprocessed data**, provide original preprocessing methodology +4. **Create common utilities** for preprocessing patterns that can be shared across models \ No newline at end of file diff --git a/language/PREPROCESSING_GUIDE.md b/language/PREPROCESSING_GUIDE.md new file mode 100644 index 0000000000..0fef3d0758 --- /dev/null +++ b/language/PREPROCESSING_GUIDE.md @@ -0,0 +1,139 @@ +# MLCommons Inference - General Preprocessing Guide + +## Overview + +This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by: +1. Model architecture +2. Backend choice (PyTorch, vLLM, SGLang) +3. Task type (summarization, Q&A, etc.) + +## Common Tokenizer Setup Pattern + +Most models follow this pattern: + +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained(model_name) +tokenizer.padding_side = "left" # Critical for generation +tokenizer.pad_token = tokenizer.eos_token +``` + +## Backend Dependencies + +Different backends have different preprocessing requirements: + +| Backend | Input Type | Chat Template Support | Use Case | +|---------|------------|---------------------|----------| +| PyTorch | Tokenized | Varies by model | Distributed inference | +| vLLM | Text | Varies by model | High-throughput serving | +| SGLang | Text | Usually disabled | Optimized serving | + +## Dataset Format + +All models expect datasets with these common fields: + +```python +{ + 'text_input': str, # Raw prompt text (required) + 'tok_input': List[int], # Pre-tokenized input (optional) + 'output': str, # Expected output for evaluation +} +``` + +## Model-Specific Preprocessing + +### Models Using Chat Templates +- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM +- **Potential others**: Check `uses_chat_template` in backend registry + +### Models Using Simple Templates +- **Llama 3.1-8B**: Instruction format for summarization +- **Llama 2-70B**: Custom format with `[INST]` markers +- **Mixtral-8x7B**: Simple instruction format + +### Models Using Raw Prompts +- **GPT-J**: Completion-style, no special formatting + +## Preprocessing Steps + +1. **Load the tokenizer** with appropriate configuration +2. **Apply model-specific formatting** (chat template or instruction format) +3. **Tokenize** with proper truncation and max length +4. **Handle padding** (left-side for generation models) + +## Example: Generic Preprocessing Function + +```python +def preprocess_for_model(text, model_name, backend="pytorch"): + """Generic preprocessing based on model and backend""" + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_name) + tokenizer.padding_side = "left" + tokenizer.pad_token = tokenizer.eos_token + + # Check if chat template should be used + if should_use_chat_template(model_name, backend): + tokens = tokenizer.apply_chat_template( + [{"role": "user", "content": text}], + add_generation_prompt=True, + truncation=True, + max_length=get_max_length(model_name) + ) + else: + # Apply model-specific template or use raw text + formatted_text = apply_model_template(text, model_name) + tokens = tokenizer.encode( + formatted_text, + truncation=True, + max_length=get_max_length(model_name) + ) + + return tokens +``` + +## Max Context Lengths + +| Model | Max Length | Notes | +|-------|------------|-------| +| DeepSeek-R1 | 32,768 | 32K context | +| Llama 3.1-8B | 8,000 | For preprocessing | +| Llama 2-70B | 1,024 | Limited context | +| Mixtral-8x7B | 1,024 | From dataset.py | +| GPT-J | ~2,048 | Standard GPT-J limit | + +## Running Inference + +```bash +# Set backend +export MLPERF_BACKEND=pytorch # or vllm, sglang + +# PyTorch backend (distributed) +torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl + +# vLLM/SGLang backends +python run_eval.py --input-file data.pkl +``` + +## Common Issues + +1. **Wrong padding side**: Always use `padding_side="left"` for generation +2. **Missing pad token**: Set `pad_token = eos_token` +3. **Backend mismatch**: Ensure preprocessing matches backend requirements +4. **Context overflow**: Respect model's maximum context length + +## Validation + +To ensure correct preprocessing: + +1. Check tokenized length doesn't exceed max +2. Verify special tokens are properly placed +3. Test with a few examples before full dataset +4. Compare against reference outputs + +## References + +- Model-specific guides in each model's directory +- Backend configuration in `utils/backend_registry.py` +- Tokenization utilities in `utils/tokenization.py` \ No newline at end of file diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md new file mode 100644 index 0000000000..2b5aa2cee9 --- /dev/null +++ b/language/deepseek-r1/PREPROCESSING.md @@ -0,0 +1,56 @@ +# DeepSeek-R1 Preprocessing + +## Model Configuration +- **Model**: `deepseek-ai/DeepSeek-R1` +- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad` +- **Max Length**: 32,768 tokens (32K) + +## Tokenization +```python +from transformers import AutoTokenizer + +# From utils/tokenization.py +tokenizer = AutoTokenizer.from_pretrained( + "deepseek-ai/DeepSeek-R1", + revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad" +) +``` + +## Preprocessing Method + +The preprocessing varies by backend: + +### PyTorch/vLLM Backends (Chat Template Enabled) +```python +# From utils/tokenization.py +tokens = tokenizer.apply_chat_template( + [{"role": "user", "content": prompt}], + add_generation_prompt=True, + max_length=32768, + truncation=True +) +``` + +### SGLang Backend (No Chat Template) +```python +tokens = tokenizer.encode( + prompt, + truncation=True, + max_length=32768 +) +``` + +## Backend Configuration +| Backend | uses_chat_template | input_type | +|---------|-------------------|------------| +| PyTorch | True | tokenized | +| vLLM | True | text | +| SGLang | False | text | + +## Dataset Format +Input data should have a `text_input` column containing the prompts. + +## Accuracy Target +``` +"mean-accuracy": 81.3582 +``` \ No newline at end of file diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md new file mode 100644 index 0000000000..2ff10d7c6e --- /dev/null +++ b/language/llama3.1-8b/PREPROCESSING.md @@ -0,0 +1,47 @@ +# Llama 3.1 8B Preprocessing + +## Model Configuration +- **Model**: `meta-llama/Llama-3.1-8B-Instruct` +- **Revision**: `be673f326cab4cd22ccfef76109faf68e41aa5f1` (for download) +- **Max Length**: 8,000 tokens (in preprocessing scripts) + +## Tokenization +```python +from transformers import AutoTokenizer + +# From prepare-calibration.py +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") +tokenizer.padding_side = "left" +tokenizer.pad_token = tokenizer.eos_token +tokenizer.model_max_length = 8000 +``` + +## Prompt Template (CNN/DailyMail Summarization) +```python +# From prepare-calibration.py and download_cnndm.py +instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:" + +# Tokenize +x["tok_input"] = tokenizer.encode(instruction_template.format_map(x)) +``` + +**Note**: This uses a simple instruction format, NOT the chat template with special tokens. + +## Dataset Preparation +```python +# Example from prepare-calibration.py +x = dict() +x["instruction"] = instruction_template +x["input"] = calibration_sample["article"] +x["tok_input"] = tokenizer.encode(instruction_template.format_map(x)) +x["output"] = calibration_sample["highlights"] +``` + +## Accuracy Targets (BF16) +``` +Datacenter: +- rouge1: 38.7792 +- rouge2: 15.9075 +- rougeL: 24.4957 +- rougeLsum: 35.793 +``` \ No newline at end of file diff --git a/language/preprocessing_examples.py b/language/preprocessing_examples.py new file mode 100644 index 0000000000..bf3fa02973 --- /dev/null +++ b/language/preprocessing_examples.py @@ -0,0 +1,168 @@ +#!/usr/bin/env python3 +""" +MLCommons Inference - Preprocessing Examples + +This script demonstrates correct preprocessing for different models. +Based on actual implementations in the codebase. +""" + +from transformers import AutoTokenizer +import pandas as pd + + +def preprocess_deepseek_r1(prompts, use_chat_template=True): + """ + Preprocess prompts for DeepSeek-R1 model. + + Args: + prompts: List of text prompts + use_chat_template: Whether to use chat template (depends on backend) + + Returns: + List of tokenized prompts + """ + tokenizer = AutoTokenizer.from_pretrained( + "deepseek-ai/DeepSeek-R1", + revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad" + ) + + tokenized = [] + for prompt in prompts: + if use_chat_template and hasattr(tokenizer, 'apply_chat_template'): + tokens = tokenizer.apply_chat_template( + [{"role": "user", "content": prompt}], + add_generation_prompt=True, + max_length=32768, + truncation=True + ) + else: + tokens = tokenizer.encode( + prompt, + truncation=True, + max_length=32768 + ) + tokenized.append(tokens) + + return tokenized + + +def preprocess_llama31_8b(articles): + """ + Preprocess articles for Llama 3.1-8B summarization. + + Args: + articles: List of articles to summarize + + Returns: + List of tokenized prompts + """ + tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") + tokenizer.padding_side = "left" + tokenizer.pad_token = tokenizer.eos_token + tokenizer.model_max_length = 8000 + + # Template from prepare-calibration.py + instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:" + + tokenized = [] + for article in articles: + prompt = instruction_template.format(input=article) + tokens = tokenizer.encode(prompt, max_length=8000, truncation=True) + tokenized.append(tokens) + + return tokenized + + +def preprocess_llama2_70b(prompts, system_prompts=None): + """ + Preprocess prompts for Llama 2-70B model. + + Args: + prompts: List of user prompts + system_prompts: Optional list of system prompts + + Returns: + List of tokenized prompts + """ + tokenizer = AutoTokenizer.from_pretrained( + "meta-llama/Llama-2-70b-chat-hf", + use_fast=False + ) + tokenizer.padding_side = "left" + tokenizer.pad_token = tokenizer.eos_token + + # Templates from processorca.py + llama_prompt_system = "[INST] <>\n{}\n<>\n\n{} [/INST]" + llama_prompt_no_system = "[INST] {} [/INST]" + + tokenized = [] + for i, prompt in enumerate(prompts): + if system_prompts and system_prompts[i]: + formatted = llama_prompt_system.format(system_prompts[i], prompt) + else: + formatted = llama_prompt_no_system.format(prompt) + + tokens = tokenizer.encode(formatted, max_length=1024, truncation=True) + tokenized.append(tokens) + + return tokenized + + +def create_dataset_format(prompts, tokenized_prompts, outputs=None): + """ + Create dataset in expected format for MLCommons. + + Args: + prompts: List of text prompts + tokenized_prompts: List of tokenized prompts + outputs: Optional list of expected outputs + + Returns: + DataFrame in expected format + """ + data = { + 'text_input': prompts, + 'tok_input': tokenized_prompts, + } + + if outputs: + data['output'] = outputs + + return pd.DataFrame(data) + + +# Example usage +if __name__ == "__main__": + # Example 1: DeepSeek-R1 + print("=== DeepSeek-R1 Example ===") + deepseek_prompts = [ + "What is machine learning?", + "Explain quantum computing in simple terms." + ] + + # With chat template (PyTorch/vLLM) + deepseek_tokens = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=True) + print(f"Prompt 1 token count: {len(deepseek_tokens[0])}") + + # Without chat template (SGLang) + deepseek_tokens_no_chat = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=False) + print(f"Prompt 1 token count (no chat): {len(deepseek_tokens_no_chat[0])}") + + # Example 2: Llama 3.1-8B + print("\n=== Llama 3.1-8B Example ===") + articles = [ + "The United Nations announced today a new climate initiative aimed at reducing global emissions by 50% by 2030. The plan includes partnerships with major corporations and governments worldwide." + ] + + llama_tokens = preprocess_llama31_8b(articles) + print(f"Article 1 token count: {len(llama_tokens[0])}") + + # Example 3: Create dataset + print("\n=== Dataset Format Example ===") + df = create_dataset_format(deepseek_prompts, deepseek_tokens) + print(df.head()) + print(f"\nDataset shape: {df.shape}") + print(f"Columns: {list(df.columns)}") + + # Save example + # df.to_pickle("preprocessed_data.pkl") \ No newline at end of file