Skip to content

Conversation

@starlightromero
Copy link

@starlightromero starlightromero commented Dec 26, 2025

Summary

This PR implements automatic monitor recreation with drift detection for the DatadogMonitor controller. When monitors are deleted externally from Datadog, the operator will automatically detect the drift and recreate the monitor while preserving the original configuration.

Key Features

  • Drift Detection: Automatically detects when monitors are missing from Datadog API
  • Monitor Recreation: Recreates missing monitors while preserving configuration
  • Status Management: New condition types (DriftDetected, Recreated) for tracking recreation events
  • Error Handling: Comprehensive error handling for API unavailability, rate limiting, and validation errors
  • Event Emission: Kubernetes events are emitted for successful recreations
  • Concurrency Safety: Optimistic locking and conflict resolution for concurrent operations

Implementation Details

Core Changes

  • Enhanced reconciliation loop with drift detection before force sync
  • New handleMonitorRecreation() method for recreation logic
  • Added DriftDetected and Recreated condition types to API
  • Enhanced error categorization and retry logic
  • Added MonitorRecreated event type for Kubernetes events

Testing

  • Property-based tests: 8 comprehensive properties with 100+ iterations each
  • Unit tests: Complete coverage of new methods and enhanced functionality
  • Integration tests: End-to-end workflow testing with envtest framework
  • Drift simulation: Tests for various API error scenarios and edge cases

Test Results

All tests pass successfully:

  • 47 unit test cases covering all scenarios
  • Property-based tests validate correctness across random inputs
  • Integration tests verify end-to-end workflow
  • Backward compatibility maintained for existing functionality

Breaking Changes

None. This is a backward-compatible enhancement that adds new functionality without modifying existing behavior.

Related Issues

Implements monitor recreation feature for improved reliability and operational excellence.

Fixes #1962

Checklist

  • Tests added/updated
  • Documentation updated (inline comments)
  • Backward compatibility maintained
  • Error handling implemented
  • Logging added for debugging
  • Property-based testing for correctness validation

@starlightromero starlightromero requested a review from a team as a code owner December 26, 2025 02:40
- Add drift detection to reconciliation loop to identify missing monitors
- Implement handleMonitorRecreation method for automatic recreation
- Add new condition types: DriftDetected, Recreated for status tracking
- Enhance error handling for API unavailability, rate limiting, and validation
- Add Kubernetes event emission for successful monitor recreations
- Implement concurrent operation safety with optimistic locking
- Add comprehensive property-based tests for all recreation scenarios
- Add integration tests for end-to-end recreation workflow

This feature automatically detects when monitors are deleted externally
and recreates them while preserving configuration and maintaining
proper status reporting.

Closes: #monitor-recreation-feature
Signed-off-by: Starlight Romero <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DatadogMonitor] Monitor not found

1 participant