Skip to content

Conversation

@paulohtb6
Copy link
Collaborator

@paulohtb6 paulohtb6 commented Oct 8, 2025

Description

Adds Shadowing docs.
Adds emergency runbook.

Resolves https://redpandadata.atlassian.net/browse/DOC-1665
Review deadline: Oct 17th

Page previews

Shadowing

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@netlify
Copy link

netlify bot commented Oct 8, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 82147e4
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/690123383716220008801885
😎 Deploy Preview https://deploy-preview-1381--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

  • Added a navigation entry for a new Shadowing guide under Redpanda deployment manual.
  • Introduced a comprehensive Shadowing documentation page covering architecture, scope, prerequisites, setup, configuration, filtering, monitoring, failover behavior, and best practices, with CLI/Admin API examples.
  • Added an emergency runbook page for disaster failover of Shadow Links, including assessment, verification, failover execution (cluster-wide or selective), monitoring, app reconfiguration, troubleshooting, recovery, and post-incident steps.
  • Included enterprise licensing cross-reference in the emergency guide.
  • No changes to exported/public code entities.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Sec as Auth/TLS
  participant Obs as Monitoring

  rect rgb(235, 245, 255)
  note over Admin,Ctrl: Configure Shadowing
  Admin->>Ctrl: Create shadow link (templates, filters)
  Ctrl->>Sec: Authenticate / TLS handshake
  Ctrl->>Prim: Apply link config
  Prim-->>Shadow: Establish replication channel
  end

  rect rgb(245, 255, 235)
  note over Prim,Shadow: Ongoing Replication (normal ops)
  Prim-->>Shadow: Replicate topics/configs/ACLs/schema
  Prim-->>Shadow: Preserve offsets/timestamps (where applicable)
  Admin->>Ctrl: rpk/admin queries (status/metrics)
  Ctrl-->>Obs: Emit metrics/alerts
  end

  rect rgb(255, 245, 235)
  note right of Admin: Planned ops are handled in Shadowing guide
  end
Loading
sequenceDiagram
  autonumber
  actor Admin as Operator
  participant Prim as Primary Cluster
  participant Shadow as Shadow Cluster
  participant Ctrl as Admin API / rpk
  participant Apps as Applications/Clients
  participant Obs as Monitoring

  rect rgb(255, 245, 235)
  note over Admin,Prim: Emergency Failover Runbook
  Admin->>Prim: Assess incident, document state
  Admin->>Shadow: Verify readiness/health
  Admin->>Ctrl: Initiate failover (full or selective)
  Ctrl->>Shadow: Transition shadow links (FAILING_OVER→ACTIVE)
  Shadow-->>Obs: Report progress/status
  end

  rect rgb(245, 255, 235)
  note over Apps,Shadow: Post-failover
  Admin->>Apps: Update bootstrap/endpoints, TLS/ACLs
  Apps->>Shadow: Reconnect and resume traffic
  Admin->>Ctrl: Verify topics/consumer groups/offsets
  end

  alt Issues detected
    Obs-->>Admin: Alerts (PAUSED, stuck states, auth failures)
    Admin->>Ctrl: Troubleshoot per runbook steps
  else Stable
    Admin->>Prim: Plan recovery/back-sync later
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly conveys the main change by stating that shadowing documentation is being added and does not include extraneous details, making it clear and focused on the core update.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed The pull request description follows the required template structure with all essential sections present and appropriately filled out. It includes a clear description of the changes (Shadowing docs and emergency runbook), properly references the Jira ticket (DOC-1665), provides a review deadline (Oct 17th), includes at least one page preview URL with proper formatting, and has the appropriate checkbox selected for "New feature." While the description text itself is concise, it accurately conveys the nature of the changes, and the supporting information (Jira link, preview, and checkbox) is complete and correct.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shadowing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@paulohtb6 paulohtb6 marked this pull request as ready for review October 15, 2025 02:44
@paulohtb6 paulohtb6 requested a review from a team as a code owner October 15, 2025 02:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
modules/ROOT/nav.adoc (1)

88-91: Add the emergency failover doc to navigation

Shadowing entry looks good. Add a sibling nav item for the emergency runbook so users can find it.

Example:

 **** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
 **** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
+**** xref:deploy:redpanda/manual/resilience/emergency-shadowing.adoc[Emergency Shadowing Failover]
 **** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (2)

290-299: Avoid promoting plaintext secrets in examples

Add a callout suggesting env vars or file-based secrets for credentials (and mTLS certs/keys), not inline plaintext.

Example:

  • Prefer env vars (RPK_SASL_PASSWORD) or reference secret files
  • Link to security guidance on managing secrets

38-38: Diagram TODO

If you need help, I can draft a diagram (draw.io/mermaid) showing active→shadow replication, preserved offsets/timestamps, and replicated artifacts.

modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1)

74-83: Call out irreversibility before executing failover

Add an [IMPORTANT] note that failover promotion is irreversible; no automatic fallback. Place immediately before the commands.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 464513b and dafed89.

📒 Files selected for processing (3)
  • modules/ROOT/nav.adoc (1 hunks)
  • modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (1 hunks)
  • modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Redirect rules - redpanda-docs-preview
  • GitHub Check: Header rules - redpanda-docs-preview
  • GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (6)
modules/deploy/pages/redpanda/manual/resilience/emergency-shadowing.adoc (2)

6-10: Enterprise license note is consistent; LGTM

Keep this partial include at the top across both docs for consistency.


48-56: Verify rpk shadow subcommands and flags: Confirm that rpk shadow list, status, failover, delete, resume and their flags (--all, --topic, --no-confirm) used in emergency-shadowing.adoc (and the corresponding sections in shadowing.adoc) match the current output of rpk shadow --help.

modules/deploy/pages/redpanda/manual/resilience/shadowing.adoc (4)

330-425: Verify ShadowLinkConfig schema alignment
Ensure the YAML example’s field names (client_options, authentication_configuration, topic_metadata_sync_options, synced_shadow_topic_properties, consumer_offset_sync_options, security_sync_options) exactly match the ShadowLinkConfig schema in the Admin API or rpk CLI.


54-57: Verify and cite Shadowing’s minimum version requirement

  • Confirm that Shadowing was introduced in Redpanda v25.3 and update the prerequisite if needed.
  • Add a link to the official v25.3 release notes or product specification where this requirement is defined.

557-576: Confirm shadow-link metrics are documented and standardize type/units
Verify that each redpanda_shadow_link_* metric appears in modules/reference/pages/public-metrics-reference.adoc and update every description to explicitly specify the Prometheus type (counter vs gauge) and units (bytes, records, offsets).


231-237: Verify rpk shadow config generate exists and --output flag

Confirm this subcommand and its --output flag are implemented in the CLI; update the docs if they’re missing.

@paulohtb6 paulohtb6 changed the base branch from main to beta October 16, 2025 15:16
@bharathv
Copy link

@paulohtb6 I have a hard time finding these changes in https://deploy-preview-1381--redpanda-docs-preview.netlify.app/current/get-started/intro-to-events/ (can you please point me to the exact URL).

@paulohtb6
Copy link
Collaborator Author

@bharathv Hey Bharath. Changes are in the page previews section on the PR description.

Copying them here too
Page previews
Shadowing
Shadowing runbook

@@ -0,0 +1,212 @@
= Shadowing Runbook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulohtb6 I think the page should be renamed too, so "emergency" is not in the URL. Also, the term runbook feels internal to me. What do you think about Failover for Disaster Recovery or Disaster Recovery Guide? cc @Feediver1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Shadowing Guide". Let me know if it's ok or if I should change more.


Redpanda v25.3 introduces xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing], an Enterprise-licensed disaster recovery solution that provides asynchronous, offset-preserving replication between distinct Redpanda clusters. Shadowing enables cross-region data protection by replicating topic data, configurations, consumer group offsets, ACLs, and Schema Registry data with byte-level fidelity.

The shadow cluster operates in read-only mode while continuously receiving updates from the source cluster. During a disaster, you can fail over individual topics or an entire shadow link to make resources fully writable for production traffic. See xref:deploy:redpanda/manual/resilience/shadowing-guide.adoc[Emergency Shadowing Guide] for emergency procedures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest shadowing-guide.adoc[] since this keeps getting renamed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also suggest failover (one word) here, since most doc uses that. I still think we should discuss overall usage, but for now I'd keep it all consistent.

@michael-redpanda
Copy link
Contributor

@paulohtb6 can we add a couple of restrictions/notifications for users:

  • Users should avoid using WASM data transforms on the Shadow Cluster - we will prevent WASM data transforms from writing to shadow topics, but the concern would be around performance impact that transforms would have on the Shadow Link
  • Users should not attempt to shadow source topics that have write caching enabled. Write cacheing is a write-path optimization that could result in data loss on the source cluster (details: https://docs.redpanda.com/current/develop/config-topics/#configure-write-caching). What could happen is that data on a write-cache enabled topic could be loss due to a broker reset. If the Shadow Link has replicated that data already, then there would be divergence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants