sgl-project · Syspretor · Nov 5, 2025 · Oct 29, 2025 · Oct 31, 2025
diff --git a/keps/8-reduce-yaml-duplication/README.md b/keps/8-reduce-yaml-duplication/README.md
@@ -2,7 +2,7 @@
 
 ## Summary
 
-This KEP introduces `roleTemplates` field in RBG spec, enabling platform teams to define reusable Pod configurations that multiple roles can reference via `templateRef`. Uses Kubernetes Strategic Merge Patch for merging, reducing YAML duplication by approximately 20%.
+This KEP introduces `roleTemplates` field in RBG spec, enabling platform teams to define reusable Pod configurations that multiple roles can reference via `templateRef`. Roles use a `templatePatch` field to overlay role-specific configurations. Uses Kubernetes Strategic Merge Patch for merging, reducing YAML duplication by approximately 20%.
 
 ## Motivation
 
@@ -14,6 +14,7 @@ Managing multi-role RBG deployments requires duplicating base configurations acr
 ### Goals
 - Introduce `spec.roleTemplates` array for reusable Pod configurations
 - Enable `role.templateRef` to reference templates within same RBG
+- Introduce `role.templatePatch` field for overlay configurations when using RoleTemplate
 - Use Kubernetes Strategic Merge Patch for configuration merging
 - No template variables (no `{{.var}}` syntax)
 
@@ -30,20 +31,81 @@ Managing multi-role RBG deployments requires duplicating base configurations acr
 
 **Story 2**: SRE updates `roleTemplates[0].template.spec.containers[0].image`, controller automatically rolls update to all referencing roles.
 
+**Story 3**: User runs `kubectl rbg preview` to visualize merged configurations before applying.
+
 ### Risks and Mitigations
 
+ - **Field naming ambiguity**: Using single field name for both complete config and overlay could cause confusion
+    - **Mitigation**: Design separate fields (`template` for traditional mode, `templatePatch` for overlay mode) with mutual exclusivity, following Kubernetes ecosystem naming conventions (e.g., Argo Workflows `podSpecPatch`)
+
 - **Strategic Merge complexity**: Users may not understand how volumes, env, and other fields merge, leading to configuration errors
   - **Mitigation 1**: Provide clear merge behavior table in documentation, explaining which fields merge by name vs complete replacement
-  - **Mitigation 2**: Add dry-run capability to rbgctl (similar to `kubectl apply --dry-run`) to preview merged configuration before applying
+  - **Mitigation 2**: Add preview/diff tools to visualize and compare merged configurations
 
 - **Template reference errors**: Roles referencing non-existent templates cause deployment failures
   - **Mitigation**: Controller validates templateRef exists during reconciliation and returns explicit error messages
 
-- **Unexpected field overrides**: Users unclear which roleTemplate fields are overridden by role.template
+- **Unexpected field overrides**: Users unclear which roleTemplate fields are overridden by role.templatePatch
   - **Mitigation**: Document Strategic Merge behavior explicitly - resources/command use complete replacement, volumes/env merge by name
 
 ## Design Details
 
+### Naming Convention
+
+KEP-8 introduces the `templatePatch` field, following the `{target}Patch` naming pattern to align with Kubernetes ecosystem conventions:
+
+- **Kubernetes ecosystem precedent**: Argo Workflows uses `podSpecPatch` (not `patchPodSpec`)
+- **Semantic clarity**: `templatePatch` clearly indicates "a patch for the template" (noun + modifier)
+- **API design principle**: Field names should describe content/type (noun), not operations (verb)
+
+**Current naming landscape**:
+```yaml
+# KEP-8 (new, aligned with ecosystem)
+role.templatePatch          # {target}Patch pattern
+
+# Existing LWS fields (retained for backward compatibility)
+leaderWorkerSet.patchLeaderTemplate  # patch{Target} pattern
+leaderWorkerSet.patchWorkerTemplate  # patch{Target} pattern
+```
+
+**Future consideration**: A separate proposal may standardize existing LWS fields to `leaderTemplatePatch` / `workerTemplatePatch` for consistency, but this is deferred to avoid breaking changes in v1alpha1.
+
+### Field Semantics: Template vs TemplatePatch
+
+**Traditional Mode**:
+```yaml
+roles:
+- name: prefill
+  template:              # Complete configuration
+    spec: {...}
+```
+
+**With RoleTemplate**:
+```yaml
+roleTemplates:
+- name: base
+  template:              # Base template
+    spec: {...}
+
+roles:
+- name: prefill
+  templateRef: {name: base}
+  templatePatch:         # Overlay configuration
+    spec: {...}
+```
+
+**Key Points**:
+- `template` and `templatePatch` are mutually exclusive in `role`
+- Naming follows `{target}Patch` convention (aligned with Argo Workflows `podSpecPatch`)
+- Note: Existing LWS fields `patchLeaderTemplate`/`patchWorkerTemplate` retain current naming for backward compatibility
+
+**Validation**:
+
+| templateRef | template | templatePatch |
+|-------------|----------|---------------|
+| Not set | Required | Rejected |
+| Set | Rejected | Required |
+
 ### API Overview
 
 **Basic Example (StatefulSet)**:
@@ -93,7 +155,7 @@ spec:
     replicas: 2
     templateRef:
       name: sglang-base
-    template:
+    templatePatch:
       spec:
         containers:
         - name: sglang
@@ -119,7 +181,7 @@ spec:
     replicas: 1
     templateRef:
       name: sglang-base
-    template:
+    templatePatch:
       spec:
         containers:
         - name: sglang
@@ -182,6 +244,13 @@ spec:
     replicas: 1
     templateRef:
       name: sglang-base
+    templatePatch:
+      spec:
+        containers:
+        - name: sglang
+          env:
+          - name: MODEL
+            value: Qwen3-32B
     workload:
       apiVersion: leaderworkerset.x-k8s.io/v1
       kind: LeaderWorkerSet
@@ -210,60 +279,64 @@ spec:
             - "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2 --dist-init-addr $(LWS_LEADER_ADDRESS):20000 --nnodes $(LWS_GROUP_SIZE) --node-rank $(LWS_WORKER_INDEX)"
 ```
 
-In LWS scenarios, roleTemplate provides shared configuration (image, volumes, resources), while patchLeaderTemplate/patchWorkerTemplate customize distributed commands using LWS environment variables.
+In LWS scenarios, roleTemplate provides shared configuration (image, volumes, resources), `role.templatePatch` adds role-specific config, and patchLeaderTemplate/patchWorkerTemplate customize distributed commands.
 
 **Merge Behavior**:
 
 | Field | Merge Strategy | Example |
 |-------|---------------|---------|
-| Volumes | Merge by `name` | Template has volume "model", role adds "cache" → Both present |
-| Env | Merge by `name` | Template has env "POD_IP", role adds "ROLE" → Both present |
-| Resources | Replace entirely | Template has 2 GPU, role specifies 4 GPU → 4 GPU used |
-| Command | Replace entirely | Template has base cmd, role specifies full cmd → Role cmd used |
+| Volumes | Merge by `name` | Template has volume "model", templatePatch adds "cache" → Both present |
+| Env | Merge by `name` | Template has env "POD_IP", templatePatch adds "ROLE" → Both present |
+| Resources | Replace entirely | Template has 2 GPU, templatePatch specifies 4 GPU → 4 GPU used |
+| Command | Replace entirely | Template has base cmd, templatePatch specifies full cmd → templatePatch cmd used |
+
+> Uses Kubernetes [Strategic Merge Patch](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#use-a-strategic-merge-patch-to-update-a-deployment) for native merge semantics.
+
 
 ### Controller Behavior
 
 **StatefulSet Scenario**:
 ```mermaid
 flowchart TD
     A[Reconcile Role] --> B{Has templateRef?}
-    B -->|Yes| C[Fetch roleTemplate by name]
-    C --> D[Strategic Merge: roleTemplate.template + role.template]
-    D --> E[Create StatefulSet]
-    B -->|No| F[Use role.template directly]
-    F --> E
+    B -->|Yes| C[Validate: templatePatch is set]
+    C --> D[Fetch roleTemplate by name]
+    D --> E[Strategic Merge: roleTemplate.template + role.templatePatch]
+    E --> F[Create StatefulSet]
+    B -->|No| G[Validate: template is set]
+    G --> H[Use role.template directly]
+    H --> F
 ```
 
 **LWS Scenario**:
 ```mermaid
 flowchart TD
     A[Reconcile Role] --> B{Has templateRef?}
-    B -->|Yes| C[Fetch roleTemplate by name]
-    C --> D[Strategic Merge: roleTemplate.template + role.template]
-    D --> E[Merged Base Template]
-    B -->|No| E
-
-    E --> F[Apply patchLeaderTemplate to Base]
-    F --> G[Apply patchWorkerTemplate to Base]
-    G --> H[Create LeaderWorkerSet]
+    B -->|Yes| C[Validate: templatePatch is set]
+    C --> D[Fetch roleTemplate by name]
+    D --> E[Strategic Merge: roleTemplate.template + role.templatePatch]
+    E --> F[Merged Base Template]
+    B -->|No| G[Validate: template is set]
+    G --> H[Use role.template]
+    H --> F
+
+    F --> I[Apply patchLeaderTemplate to Base]
+    I --> J[Apply patchWorkerTemplate to Base]
+    J --> K[Create LeaderWorkerSet]
 ```
 
 **Processing Order**:
-1. Validate templateRef exists in roleTemplates array
-2. Fetch roleTemplate by name
-3. Strategic Merge: `roleTemplate.template` + `role.template` → merged base template
+1. Validate: `template` XOR `templatePatch` based on `templateRef`
+2. Fetch roleTemplate by name (if using RoleTemplate)
+3. Strategic Merge: `roleTemplate.template` + `role.templatePatch` → merged base template
 4. For LWS workloads:
    - Apply `patchLeaderTemplate` to merged base → Leader Pod spec
    - Apply `patchWorkerTemplate` to merged base → Worker Pod spec
 5. For StatefulSet workloads:
    - Use merged base template directly
 6. Create workload (StatefulSet or LeaderWorkerSet)
 
-**Important**: The `role.template` field serves as both:
-- Source for Strategic Merge with roleTemplate
-- Base template for LWS patchLeaderTemplate/patchWorkerTemplate
-
-This means patchLeaderTemplate/patchWorkerTemplate apply **after** roleTemplate merging.
+**Important**: The `role.templatePatch` provides overlay for roleTemplate, then the merged result serves as base for LWS patches. This means patchLeaderTemplate/patchWorkerTemplate apply **after** roleTemplate merging.
 
 #### Update Workflow
 
@@ -293,7 +366,7 @@ flowchart TD
 **Update Trigger**:
 - Each workload carries a `role-revision-hash-{roleName}` label
 - Reconciler compares current label value with expected hash from new Revision
-- If different → Resolves template (merging updated roleTemplate + role.template) → Updates workload via Server-Side Apply
+- If different → Resolves template (merging updated `roleTemplate.template` + `role.templatePatch`) → Updates workload via Server-Side Apply
 - Workload controller (StatefulSet/Deployment/LWS) detects PodTemplate change → Triggers rolling update per rolloutStrategy
 
 #### ControllerRevision Evolution During roleTemplate Update
@@ -326,18 +399,82 @@ kubectl apply -f rbg-with-image-v1.0.yaml
 
 Rollback creates a new Revision (incrementing number) with previous content, maintaining linear history while allowing bidirectional updates.
 
+### Preview and Diff Tools
+
+To help users understand Strategic Merge behavior, KEP-8 provides preview/diff capabilities via kubectl plugins:
+
+#### Preview: Visualize Merged Configurations
+
+```bash
+kubectl rbg preview my-rbg.yaml
+```
+
+Resolves and displays the final PodTemplateSpec for each role after merging `roleTemplate.template` + `role.templatePatch`.
+
+**Output Example**:
+```yaml
+# Role: prefill (resolved)
+---
+spec:
+  containers:
+  - name: sglang
+    image: sglang:v0.5.1      # from roleTemplate
+    command: [...]             # from templatePatch
+    resources:
+      limits:
+        nvidia.com/gpu: "2"    # from roleTemplate
+
+# Role: decode (resolved)
+---
+spec:
+  containers:
+  - name: sglang
+    image: sglang:v0.5.1      # from roleTemplate
+    command: [...]             # from templatePatch (different from prefill)
+```
+
+#### Diff: Compare Configurations
+
+```bash
+kubectl rbg diff my-rbg.yaml
+```
+
+Compares resolved configurations between local YAML file and cluster state.
+
+**Output Example**:
+```diff
+Role: prefill
+- image: sglang:v0.5.0        # cluster
++ image: sglang:v0.5.1        # local
+
+Role: decode
+- image: sglang:v0.5.0
++ image: sglang:v0.5.1
+```
+
+**Use Cases**:
+- Validate merge behavior before applying changes
+- Understand impact of roleTemplate modifications on all referencing roles
+- Review configuration changes during PR reviews
+
+**Implementation**: Extends existing `kubectl-rbg-*` plugin pattern with new commands.
+
 ### Test Plan
 
 **Unit Tests**:
 - Template resolution by name
 - Strategic Merge logic for different field types (volumes, env, resources, command)
+- Validation: `templatePatch` requires `templateRef`
+- Validation: `template` and `templatePatch` are mutually exclusive
 - Validation: templateRef references non-existent template
 - Validation: templateRef references template in different RBG (should fail)
 
 **Integration Tests**:
 - Create RBG with roleTemplates, verify merge behavior
 - Update roleTemplate image, verify all referencing roles updated
 - LWS scenario: verify patchLeaderTemplate/patchWorkerTemplate applied after merge
+- Preview: verify resolved output correctness
+- Diff: verify comparison logic
 
 **E2E Tests**:
 - Deploy multi-role RBG using roleTemplates
@@ -346,8 +483,9 @@ Rollback creates a new Revision (incrementing number) with previous content, mai
 - LWS deployment: verify leader and worker Pods have correct merged configuration
 
 **Manual Verification**:
-- Use `rbgctl get rbg <name> --dry-run` to preview merged configuration before applying
-- Verify error messages when templateRef is invalid
+- Use `kubectl rbg preview` to visualize merged configuration before applying
+- Use `kubectl rbg diff` to compare local changes with cluster state
+- Verify error messages when templateRef is invalid or fields are mutually exclusive
 
 ## Alternatives
 
@@ -360,6 +498,8 @@ Rollback creates a new Revision (incrementing number) with previous content, mai
 - RBAC complexity (who can modify templates?)
 - Template versioning and compatibility issues
 
+**Single `template` field with dual semantics**: Simpler API but confusing semantics (same field, different meaning based on `templateRef`). Community feedback favored explicit `templatePatch` field.
+
 **RoleTemplate approach chosen because**:
 - Templates scoped to single RBG (simpler RBAC)
 - Native Strategic Merge Patch (Kubernetes-native behavior)
@@ -393,6 +533,11 @@ roles:
 - name: prefill
   templateRef:
     name: sglang-base
+  templatePatch:
+    spec:
+      containers:
+      - name: sglang
+        # extraArgs appended to base command
   extraArgs:
   - --disaggregation-mode
   - prefill
@@ -407,3 +552,4 @@ roles:
 - **2025-10-21**: Initial KEP-8 proposal submitted
 - **2025-10-24**: Revised to focus on RoleTemplates (Phase 1), defer ExtraArgs
 - **2025-10-27**: Supplemented Controller Behavior section with update workflow and ControllerRevision evolution details
+- **2025-10-30**: Updated API design with `role.templatePatch` field and preview/diff tooling, aligned naming with Kubernetes ecosystem conventions