diff --git a/PREPROCESSING-TEMPLATE.md b/PREPROCESSING-TEMPLATE.md
new file mode 100644
index 0000000000..57a28bd80e
--- /dev/null
+++ b/PREPROCESSING-TEMPLATE.md
@@ -0,0 +1,127 @@
+# Dataset Preprocessing Documentation Template
+
+## Purpose
+This template provides a standardized way to document dataset preprocessing steps for MLCommons inference benchmarks, ensuring reproducibility and transparency.
+
+## Template Structure
+
+### Model: [MODEL_NAME]
+**Dataset:** [DATASET_NAME]  
+**Evaluation Task:** [TASK_DESCRIPTION]
+
+#### Data Source
+- **Raw Dataset:** [SOURCE_AND_FORMAT]
+- **Download Method:** [HOW_TO_OBTAIN]
+- **License:** [LICENSE_INFO]
+
+#### Preprocessing Pipeline
+
+##### 1. Tokenization
+```python
+# Example based on llama2-70b/processorca.py pattern
+from transformers import [TOKENIZER_CLASS]
+tokenizer = [TOKENIZER_CLASS].from_pretrained(model_dir)
+tokens = tokenizer(text)["input_ids"]
+```
+
+##### 2. Filtering Steps
+- **Language Filter:** [DESCRIPTION]
+- **Length Filter:** [SEQUENCE_LENGTH_LIMITS]
+- **Quality Filter:** [QUALITY_CRITERIA]
+- **Content Filter:** [CONTENT_RESTRICTIONS]
+
+##### 3. Formatting
+- **Input Format:** [INPUT_TEMPLATE]
+- **Output Format:** [OUTPUT_TEMPLATE]
+- **Special Tokens:** [SPECIAL_TOKEN_HANDLING]
+
+##### 4. Sampling Strategy
+- **Total Samples:** [NUMBER]
+- **Sampling Method:** [RANDOM/STRATIFIED/OTHER]
+- **Validation Split:** [IF_APPLICABLE]
+
+#### Adaptation Guide
+**For Different Tokenizers:**
+- Modify tokenizer initialization
+- Adjust sequence length limits
+- Update special token handling
+
+**For Different Models:**
+- Update input/output templates
+- Adjust filtering criteria
+- Modify prompt formatting
+
+#### Files Generated
+- **Main Dataset:** [FILENAME_AND_FORMAT]
+- **Calibration Set:** [FILENAME_AND_FORMAT]
+- **Metadata:** [FILENAME_AND_FORMAT]
+
+#### Verification
+- **Expected Sample Count:** [NUMBER]
+- **Checksum/Hash:** [IF_AVAILABLE]
+- **Quality Metrics:** [ROUGE/BLEU/OTHER]
+
+---
+
+## Example Applications
+
+### Llama3.1-8b (CNN/DailyMail)
+**Dataset:** CNN/DailyMail 3.0.0  
+**Evaluation Task:** Text Summarization
+
+#### Data Source
+- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
+- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0")`
+- **License:** Apache 2.0
+
+#### Preprocessing Pipeline
+##### 1. Tokenization
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+##### 2. Formatting
+- **Input Template:** 
+```
+Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
+
+Article:
+{article}
+
+Summary:
+```
+
+##### 3. Current Gaps
+- ❌ No documented filtering steps
+- ❌ No sampling strategy explanation  
+- ❌ No quality control measures
+- ❌ No reproducible preprocessing script
+
+### DeepSeek-r1 (Multi-domain Evaluation)
+**Dataset:** Ensemble of AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench  
+**Evaluation Task:** Multi-domain Reasoning
+
+#### Data Source
+- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2
+- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
+- **License:** Various (CC0, MIT, CC BY 4.0)
+
+#### Current Gaps
+- ❌ No documented preprocessing steps
+- ❌ No tokenization details
+- ❌ No filtering or sampling explanation
+- ❌ No adaptation guide for other models
+- ❌ Cannot reproduce from raw sources
+
+---
+
+## Implementation Recommendation
+
+1. **For each model directory**, add `PREPROCESSING.md` following this template
+2. **For models with preprocessing scripts**, document the steps in the README
+3. **For models using preprocessed data**, provide original preprocessing methodology
+4. **Create common utilities** for preprocessing patterns that can be shared across models
\ No newline at end of file
diff --git a/language/PREPROCESSING_GUIDE.md b/language/PREPROCESSING_GUIDE.md
new file mode 100644
index 0000000000..0fef3d0758
--- /dev/null
+++ b/language/PREPROCESSING_GUIDE.md
@@ -0,0 +1,139 @@
+# MLCommons Inference - General Preprocessing Guide
+
+## Overview
+
+This guide covers common preprocessing patterns across all language models in MLCommons Inference benchmarks. Preprocessing varies by:
+1. Model architecture
+2. Backend choice (PyTorch, vLLM, SGLang)
+3. Task type (summarization, Q&A, etc.)
+
+## Common Tokenizer Setup Pattern
+
+Most models follow this pattern:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.padding_side = "left"  # Critical for generation
+tokenizer.pad_token = tokenizer.eos_token
+```
+
+## Backend Dependencies
+
+Different backends have different preprocessing requirements:
+
+| Backend | Input Type | Chat Template Support | Use Case |
+|---------|------------|---------------------|----------|
+| PyTorch | Tokenized | Varies by model | Distributed inference |
+| vLLM | Text | Varies by model | High-throughput serving |
+| SGLang | Text | Usually disabled | Optimized serving |
+
+## Dataset Format
+
+All models expect datasets with these common fields:
+
+```python
+{
+    'text_input': str,      # Raw prompt text (required)
+    'tok_input': List[int], # Pre-tokenized input (optional)
+    'output': str,          # Expected output for evaluation
+}
+```
+
+## Model-Specific Preprocessing
+
+### Models Using Chat Templates
+- **DeepSeek-R1**: Uses `apply_chat_template` with PyTorch/vLLM
+- **Potential others**: Check `uses_chat_template` in backend registry
+
+### Models Using Simple Templates
+- **Llama 3.1-8B**: Instruction format for summarization
+- **Llama 2-70B**: Custom format with `[INST]` markers
+- **Mixtral-8x7B**: Simple instruction format
+
+### Models Using Raw Prompts
+- **GPT-J**: Completion-style, no special formatting
+
+## Preprocessing Steps
+
+1. **Load the tokenizer** with appropriate configuration
+2. **Apply model-specific formatting** (chat template or instruction format)
+3. **Tokenize** with proper truncation and max length
+4. **Handle padding** (left-side for generation models)
+
+## Example: Generic Preprocessing Function
+
+```python
+def preprocess_for_model(text, model_name, backend="pytorch"):
+    """Generic preprocessing based on model and backend"""
+    
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    
+    # Check if chat template should be used
+    if should_use_chat_template(model_name, backend):
+        tokens = tokenizer.apply_chat_template(
+            [{"role": "user", "content": text}],
+            add_generation_prompt=True,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    else:
+        # Apply model-specific template or use raw text
+        formatted_text = apply_model_template(text, model_name)
+        tokens = tokenizer.encode(
+            formatted_text,
+            truncation=True,
+            max_length=get_max_length(model_name)
+        )
+    
+    return tokens
+```
+
+## Max Context Lengths
+
+| Model | Max Length | Notes |
+|-------|------------|-------|
+| DeepSeek-R1 | 32,768 | 32K context |
+| Llama 3.1-8B | 8,000 | For preprocessing |
+| Llama 2-70B | 1,024 | Limited context |
+| Mixtral-8x7B | 1,024 | From dataset.py |
+| GPT-J | ~2,048 | Standard GPT-J limit |
+
+## Running Inference
+
+```bash
+# Set backend
+export MLPERF_BACKEND=pytorch  # or vllm, sglang
+
+# PyTorch backend (distributed)
+torchrun --nproc_per_node=8 run_eval_mpi.py --input-file data.pkl
+
+# vLLM/SGLang backends
+python run_eval.py --input-file data.pkl
+```
+
+## Common Issues
+
+1. **Wrong padding side**: Always use `padding_side="left"` for generation
+2. **Missing pad token**: Set `pad_token = eos_token`
+3. **Backend mismatch**: Ensure preprocessing matches backend requirements
+4. **Context overflow**: Respect model's maximum context length
+
+## Validation
+
+To ensure correct preprocessing:
+
+1. Check tokenized length doesn't exceed max
+2. Verify special tokens are properly placed
+3. Test with a few examples before full dataset
+4. Compare against reference outputs
+
+## References
+
+- Model-specific guides in each model's directory
+- Backend configuration in `utils/backend_registry.py`
+- Tokenization utilities in `utils/tokenization.py`
\ No newline at end of file
diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md
new file mode 100644
index 0000000000..2b5aa2cee9
--- /dev/null
+++ b/language/deepseek-r1/PREPROCESSING.md
@@ -0,0 +1,56 @@
+# DeepSeek-R1 Preprocessing
+
+## Model Configuration
+- **Model**: `deepseek-ai/DeepSeek-R1`
+- **Revision**: `56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad`
+- **Max Length**: 32,768 tokens (32K)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From utils/tokenization.py
+tokenizer = AutoTokenizer.from_pretrained(
+    "deepseek-ai/DeepSeek-R1",
+    revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+)
+```
+
+## Preprocessing Method
+
+The preprocessing varies by backend:
+
+### PyTorch/vLLM Backends (Chat Template Enabled)
+```python
+# From utils/tokenization.py
+tokens = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    max_length=32768,
+    truncation=True
+)
+```
+
+### SGLang Backend (No Chat Template)
+```python
+tokens = tokenizer.encode(
+    prompt,
+    truncation=True,
+    max_length=32768
+)
+```
+
+## Backend Configuration
+| Backend | uses_chat_template | input_type |
+|---------|-------------------|------------|
+| PyTorch | True | tokenized |
+| vLLM | True | text |
+| SGLang | False | text |
+
+## Dataset Format
+Input data should have a `text_input` column containing the prompts.
+
+## Accuracy Target
+```
+"mean-accuracy": 81.3582
+```
\ No newline at end of file
diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md
new file mode 100644
index 0000000000..2ff10d7c6e
--- /dev/null
+++ b/language/llama3.1-8b/PREPROCESSING.md
@@ -0,0 +1,47 @@
+# Llama 3.1 8B Preprocessing
+
+## Model Configuration
+- **Model**: `meta-llama/Llama-3.1-8B-Instruct`
+- **Revision**: `be673f326cab4cd22ccfef76109faf68e41aa5f1` (for download)
+- **Max Length**: 8,000 tokens (in preprocessing scripts)
+
+## Tokenization
+```python
+from transformers import AutoTokenizer
+
+# From prepare-calibration.py
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.model_max_length = 8000
+```
+
+## Prompt Template (CNN/DailyMail Summarization)
+```python
+# From prepare-calibration.py and download_cnndm.py
+instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"
+
+# Tokenize
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
+```
+
+**Note**: This uses a simple instruction format, NOT the chat template with special tokens.
+
+## Dataset Preparation
+```python
+# Example from prepare-calibration.py
+x = dict()
+x["instruction"] = instruction_template
+x["input"] = calibration_sample["article"]
+x["tok_input"] = tokenizer.encode(instruction_template.format_map(x))
+x["output"] = calibration_sample["highlights"]
+```
+
+## Accuracy Targets (BF16)
+```
+Datacenter:
+- rouge1: 38.7792
+- rouge2: 15.9075
+- rougeL: 24.4957
+- rougeLsum: 35.793
+```
\ No newline at end of file
diff --git a/language/preprocessing_examples.py b/language/preprocessing_examples.py
new file mode 100644
index 0000000000..bf3fa02973
--- /dev/null
+++ b/language/preprocessing_examples.py
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+MLCommons Inference - Preprocessing Examples
+
+This script demonstrates correct preprocessing for different models.
+Based on actual implementations in the codebase.
+"""
+
+from transformers import AutoTokenizer
+import pandas as pd
+
+
+def preprocess_deepseek_r1(prompts, use_chat_template=True):
+    """
+    Preprocess prompts for DeepSeek-R1 model.
+    
+    Args:
+        prompts: List of text prompts
+        use_chat_template: Whether to use chat template (depends on backend)
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained(
+        "deepseek-ai/DeepSeek-R1",
+        revision="56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad"
+    )
+    
+    tokenized = []
+    for prompt in prompts:
+        if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
+            tokens = tokenizer.apply_chat_template(
+                [{"role": "user", "content": prompt}],
+                add_generation_prompt=True,
+                max_length=32768,
+                truncation=True
+            )
+        else:
+            tokens = tokenizer.encode(
+                prompt,
+                truncation=True,
+                max_length=32768
+            )
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def preprocess_llama31_8b(articles):
+    """
+    Preprocess articles for Llama 3.1-8B summarization.
+    
+    Args:
+        articles: List of articles to summarize
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.model_max_length = 8000
+    
+    # Template from prepare-calibration.py
+    instruction_template = "Summarize the following news article in 128 tokens. Please output the summary only, without any other text.\n\nArticle:\n{input}\n\nSummary:"
+    
+    tokenized = []
+    for article in articles:
+        prompt = instruction_template.format(input=article)
+        tokens = tokenizer.encode(prompt, max_length=8000, truncation=True)
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def preprocess_llama2_70b(prompts, system_prompts=None):
+    """
+    Preprocess prompts for Llama 2-70B model.
+    
+    Args:
+        prompts: List of user prompts
+        system_prompts: Optional list of system prompts
+    
+    Returns:
+        List of tokenized prompts
+    """
+    tokenizer = AutoTokenizer.from_pretrained(
+        "meta-llama/Llama-2-70b-chat-hf",
+        use_fast=False
+    )
+    tokenizer.padding_side = "left"
+    tokenizer.pad_token = tokenizer.eos_token
+    
+    # Templates from processorca.py
+    llama_prompt_system = "<s>[INST] <<SYS>>\n{}\n<</SYS>>\n\n{} [/INST]"
+    llama_prompt_no_system = "<s>[INST] {} [/INST]"
+    
+    tokenized = []
+    for i, prompt in enumerate(prompts):
+        if system_prompts and system_prompts[i]:
+            formatted = llama_prompt_system.format(system_prompts[i], prompt)
+        else:
+            formatted = llama_prompt_no_system.format(prompt)
+        
+        tokens = tokenizer.encode(formatted, max_length=1024, truncation=True)
+        tokenized.append(tokens)
+    
+    return tokenized
+
+
+def create_dataset_format(prompts, tokenized_prompts, outputs=None):
+    """
+    Create dataset in expected format for MLCommons.
+    
+    Args:
+        prompts: List of text prompts
+        tokenized_prompts: List of tokenized prompts
+        outputs: Optional list of expected outputs
+    
+    Returns:
+        DataFrame in expected format
+    """
+    data = {
+        'text_input': prompts,
+        'tok_input': tokenized_prompts,
+    }
+    
+    if outputs:
+        data['output'] = outputs
+    
+    return pd.DataFrame(data)
+
+
+# Example usage
+if __name__ == "__main__":
+    # Example 1: DeepSeek-R1
+    print("=== DeepSeek-R1 Example ===")
+    deepseek_prompts = [
+        "What is machine learning?",
+        "Explain quantum computing in simple terms."
+    ]
+    
+    # With chat template (PyTorch/vLLM)
+    deepseek_tokens = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=True)
+    print(f"Prompt 1 token count: {len(deepseek_tokens[0])}")
+    
+    # Without chat template (SGLang)
+    deepseek_tokens_no_chat = preprocess_deepseek_r1(deepseek_prompts, use_chat_template=False)
+    print(f"Prompt 1 token count (no chat): {len(deepseek_tokens_no_chat[0])}")
+    
+    # Example 2: Llama 3.1-8B
+    print("\n=== Llama 3.1-8B Example ===")
+    articles = [
+        "The United Nations announced today a new climate initiative aimed at reducing global emissions by 50% by 2030. The plan includes partnerships with major corporations and governments worldwide."
+    ]
+    
+    llama_tokens = preprocess_llama31_8b(articles)
+    print(f"Article 1 token count: {len(llama_tokens[0])}")
+    
+    # Example 3: Create dataset
+    print("\n=== Dataset Format Example ===")
+    df = create_dataset_format(deepseek_prompts, deepseek_tokens)
+    print(df.head())
+    print(f"\nDataset shape: {df.shape}")
+    print(f"Columns: {list(df.columns)}")
+    
+    # Save example
+    # df.to_pickle("preprocessed_data.pkl")
\ No newline at end of file